Video retrieval based on deep convolutional neural network

This paper proposes an end-to-end neural network for video retrieval. Given a query video, neural network embedding is used to retrieve similar videos from a database. The main idea is simple; a neural network is trained to classify and rank videos using two terms loss function. The first term is a supervised classification term. This is a typical cross-entropy loss function for the multi-class classification using one-hot vector to indicate video class.

Supervised Classification Loss Term

The second term is a triplet loss term. It is a common term used for ranking pairs of images or videos. In triplet loss networks, three tuples are fed into the network (X, X+, X-). The triplet loss term encourage a smaller distance between positive pairs embedding and bigger distance between negative pair. If X, X+ belong to same class but X, X- belong to different classes, |F(X)-F(X+)| <|F(X)-F(X-)|

Video Ranking Loss Term

Video retrieval, for similar actions, is the main objective in this paper. Thus clips X, X+ are sampled from action class, while X, X- are sampled from different actions. The classification loss term guides network for better clips action classification, while triplet loss term enforce similar embedding within the same action class.

Two Terms Loss Function

To encode a clip of n-frames, a Siamese deep convolutional neural network extract a feature representation of each input frame. To simplify the network complexity, a video representation is created by fusing frames features using weighted average. The first fully connected layer is followed by a sigmoid layer designed to learn similarity-preserving binary-like codes. This layer acts like a n-bits hashing function. The second fully connected layer has k nodes, where k is the number of categories. Finally, the loss function combines the classification and triplet loss terms.

Network Architecture

In retrieval phase, binary hashing codes are generated by binarization which maps the binary- like outputs into 0 or 1. Exclusive-or operation is performed on the binary codes of both the query video and videos stored in datasets. This obtains pairwise Hamming distances and finds the video with the highest similarity.

The proposed approach is evaluated on HMDB51 and UCF101 using different n-bits (64,128,256,512) for hashing. As expected, more bits results in higher accuracy.

Performance comparison of different video retrieval algorithms on the UCF101 dataset. This table shows the mean Average Precision (mAP) of top10


Temporal fusion, of frame features, is done using weighted average. While trivial to implement, better approaches exist. 3D convolution and 3D pooling are simple and better integrate temporal relation between clip frames.

Optical flow, stack of difference and, dynamic images are different action clip modalities. This paper targets video action, these modalities should have been considered.

While the first connected layer acts like a hashing function, a pairwise xor operation is required between the query video and dataset videos. This can be expensive. Local sensitivity hashing is a natural extension to improve the retrieval phase computational requirement.

As mentioned previously, the loss function contains two terms: supervised classification plus triplet loss. These terms are weighted using two hyper-parameters, alpha and beta. It is a minor issue, but hyper-parameters selection and tuning is omitted in the current paper .

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store