This paper proposes an end-to-end neural network for video retrieval. Given a query video, neural network embedding is used to retrieve similar videos from a database. The main idea is simple; a neural network is trained to classify and rank videos using two terms loss function. The first term is a supervised classification term. This is a typical cross-entropy loss function for the multi-class classification using one-hot vector to indicate video class.
The second term is a triplet loss term. It is a common term used for ranking pairs of images or videos. In triplet loss networks, three tuples are fed into the network (X, X+, X-). The triplet loss term encourage a smaller distance between positive pairs embedding and bigger distance between negative pair. If X, X+ belong to same class but X, X- belong to different classes, |F(X)-F(X+)| <|F(X)-F(X-)|
Video retrieval, for similar actions, is the main objective in this paper. Thus clips X, X+ are sampled from action class, while X, X- are sampled from different actions. The classification loss term guides network for better clips action classification, while triplet loss term enforce similar embedding within the same action class.
To encode a clip of n-frames, a Siamese deep convolutional neural network extract a feature representation of each input frame. To simplify the network complexity, a video representation is created by fusing frames features using weighted average. The first fully connected layer is followed by a sigmoid layer designed to learn similarity-preserving binary-like codes. This layer acts like a n-bits hashing function. The second fully connected layer has k nodes, where k is the number of categories. Finally, the loss function combines the classification and triplet loss terms.
In retrieval phase, binary hashing codes are generated by binarization which maps the binary- like outputs into 0 or 1. Exclusive-or operation is performed on the binary codes of both the query video and videos stored in datasets. This obtains pairwise Hamming distances and finds the video with the highest similarity.
The proposed approach is evaluated on HMDB51 and UCF101 using different n-bits (64,128,256,512) for hashing. As expected, more bits results in higher accuracy.
Temporal fusion, of frame features, is done using weighted average. While trivial to implement, better approaches exist. 3D convolution and 3D pooling are simple and better integrate temporal relation between clip frames.
Optical flow, stack of difference and, dynamic images are different action clip modalities. This paper targets video action, these modalities should have been considered.
While the first connected layer acts like a hashing function, a pairwise xor operation is required between the query video and dataset videos. This can be expensive. Local sensitivity hashing is a natural extension to improve the retrieval phase computational requirement.
As mentioned previously, the loss function contains two terms: supervised classification plus triplet loss. These terms are weighted using two hyper-parameters, alpha and beta. It is a minor issue, but hyper-parameters selection and tuning is omitted in the current paper .