Action recognition datasets are expensive to build and annotate. Unlike image annotation, action recognition and detection annotation process is time consuming. A typical trick to achieve high recognition accuracy on such small datasets is using unsupervised data to pretrain a neural network. This pretrained network can be fine-tuned later using a small labelled dataset.
A video is a frame sequence that are temporally related. Given a frame sequence, one can verify their sequential order without any prior action label. This idea is utilized in multiple papers like Shuffle & Learn, Odd one out and this paper — Order prediction Network.
An unsupervised approach formulates the frame sequence verification problem as classification. While Shuffle & Learn is binary classification, Odd one out and Order prediction Network are multi-class classification problems. This paper proposes order prediction network (OPN) to identify the correct order of a frame sequence. Given a sequence of four frames, there are 4! =24 possible permutations. Since some actions are coherent both forward and backward (e.g., opening/closing a door), the permutations are reduce to 4!/2 = 12.
OPN is a siamese network structure, fed by RGB frames. The data sampling process is illustrated below. AlexNet is used to learn RGB image features — f6. While other paper concatenate the learned feature before the classification layer, it is argued that “taking one glimpse at all frames” may not capture the concept of ordering well. Thus, before the classification layer, a pairwise feature extraction is performed. These pairwise features are eventually concatenated for order prediction —multi-class classification layer.
IMHO, this approach suffers scalability issues. While Odd one out (O3N) feeds 15 frames to a single net, this approach uses four Siamese nets for four frames. Then it computes 4-choose-2 (4C2) pairwise features and this grows exponentially in terms frames count. Yet, to be fair, OPN beats O3N according to their paper.
The sampling frames process is dividing into three phases. The first phase promotes motion aware sampling. Instead of random sampling, frames are sampled from high motion windows similar to Shuffle & Learn paper
After sampling frames, spatial jittering and channel splitting are applied. Spatial jittering means sampling a random patch from each frame. In this paper, a patch of size 80x80 is randomly extracted from each 224x224 frame. Channel splitting means randomly choosing one color channel and duplicating its values to other two channels; analog to grayscale converting.
Once the unsupervised network learns a useful embedding, the network can be fined-tuned using a small labelled dataset. To do so, a classification layer is setup on top of the embedding layer. All approaches initialize the supervised network with the unsupervised pretrained weights. Some approaches train the classification layer only, on the top of the embedding layer. Others train the whole network — training resumption.