Localizing Moments in Video with Natural Language
This paper presents the moment context network (MCN) to localize moments in videos using text queries. The next figure demonstrates a case-scenario; given a query text like “the little girl jumps back after falling”, the network should detect this particular moment within a video — highlighted in green. In this paper, text queries are arbitrary natural language sentences. Yet, all videos span 25 or 30 seconds. According to my understanding, the proposed formulation supports longer videos but with high extra computational complexity.
The paper leverages triplet loss to learn an embedding space where query text embedding is close to the corresponding video moment embedding. The main contribution of the paper is two-fold: (1) The proposed network integrates local and global video features, and temporal endpoint feature; (2) The authors collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions.
This article focuses on the architecture employed and its technical details. But first, a minimal brief about the collected dataset DiDeMo is provided here. The dataset consists of over 10,000 unedited videos with 3–5 pairs of descriptions and distinct moments per video. This leads to 40,000 pairs of referring descriptions and localized moments in unedited videos as shown in the next figure. DiDeMo is collected in an open-world setting and includes diverse content such as pets, concerts, and sports games. More details about the dataset annotation and verification process are available in the paper.
All videos are trimmed to 25 or 30 seconds, then divided into five or six 5-second clips. Each 5-second clip is represented using a visual feature descriptor extracted from the FC7 layer of a pretrained VGG network. The next figure shows a 30-seconds video split into six 5-seconds clips. Each clip is represented by a feature vector in R⁴⁰⁹⁶, thus the whole video representation belongs to R^(6x4096). The moment of interest, highlighted in green, starts and ends at the third and fourth GIFs respectively. To feed this moment into the proposed MCN, local features are constructed by pooling features within the moment (3–4 GIFs) and global features are constructed by averaging over all frames in a video(1–6 GIFs ). The moment temporal information are encoded using temporal endpoint features which indicate the start and end of a candidate moment normalized to the interval [0, 1]. The final feature representing this moment into the network belongs to R⁸¹⁹⁴ as shown below.
Query sentences are represented using the dense 300 dimension Glove word embeddings while assuming the longest sentence contains 50 words. Thus a sentence representation belongs to R¹⁵⁰⁰⁰, i.e. 50*300
Triplet loss is utilized to learn an embedding where a pair of query sentence and a moment are close to each other. A standard triplet loss formulation is employed as follows:
where D(x,y) is the square distance between sentence and moment (s,m), and m is a separation margin. p_s, p_m indicates a positive pair (s,m) while p_s, n_m indicates a negative pair. The DiDeMo dataset provides positive sentences and moments pairs. Negative moments used during training can either come from different segments within the same video (intra-video negative moments) or from different videos (inter-video negative moments).
During training, inter-video negative moments are chosen to have the same start and end points as positive moments. This encourages the model to differentiate between moments based on semantic content, as opposed to when the moment occurs in the video. Contrary, intra-video negative moments are chosen to have different start and end points from the same positive video. This encourages the model to distinguish between subtle differences within a video. It learns to localize moment of interest which requires more than just recognizing an object (the girl) or an action.
To summarize, an element in a training mini-batch has four parts: (1) Sentence; (2) Positive moment; (3) negative inter-video moment; (4) negative intra-video moment. The next figure shows these four parts and the MCN architecture that learns an embedding for each part.
Both positive and negative visual moments undergo two fully connected layers. For text query, glove embedding is first computed then fed into an LSTM and final a fully connected layer. Three different distances are computed using the MCN embedding: (1) D(p_m,p_s) the distance between the positive moment-sentence pair; (2) D(neg_inter,p_s) the distance between the negative inter-video moment and sentence pair; (3) D(neg_intra,p_s) the distance between the negative intra-video moment and sentence pair.
These three distances define the network loss function as follows
where lambda is a hyperparameter. This loss function encourages a sentence embedding closer to the positive moment than both the negative inter and intra moments. The next figure presents promising qualitative results where the right moment is retrieved in the first two examples. The last row presents an example where a wrong moment is retrieved.
To quantitatively evaluate the MCN, three baselines are employed
1- Moment Frequency Prior: Without looking at query text or video, just order the video moments according to annotation frequency. Short moments toward the beginning of videos are more frequent than long moments at the end of videos. This frequency prior is computed using human annotations. This is a very weak baseline.
2- Canonical correlation analysis (CCA): associates word-embedding with deep image representation using fisher vectors. Given MCN input visual features and the language features from the best MCN language encoder, fisher vectors are employed to associate text with moments.
3- Natural Language Object Retrieval: Given an input image, a text query and a set of candidate image bounding boxes, this method leverages recurrent neural network (RNN) to score the candidate bounding boxes. For example, a query text “white car on the right” should score high with a bounding box containing The white car on the right of the image. To employ this method for moment retrieval, frame candidate bounding boxes are scored with the object retrieval model. A frame is sampled every 10 frames. The score for each candidate moment is the average of scores for frames within the moment.
The next table presents the quantitative evaluation. MCN is superior to the baselines by a large margin.
- MCN is one of the simplest architectures employed to address text to video retrieval problem. The authors publish their Caffe implementation. Thus, the paper is a good starter for those getting into this area.
- That being said, I felt some technical details are missing. Details like video frame per second (fps) and frame sampling rates are important when working with videos to be able to replicate the results — e.g. issue #1, issue #2. A VGG is utilized to extract visual features, which VGG? There is at least VGG-16 and VGG-19
- The CCA baseline uses the MCN input visual features. Since this is the most competitive baseline, I wonder if these visual features included the moment timestamp information (start and end) or not? Not stated in the paper.
- Finally, I think stronger baselines exist. More complicated architectures solve a similar problem. But I am not sure about the timeline since this paper is published in 2017.