Localizing Moments in Video with Natural Language

Given a text query, localize the corresponding moment in a video.
Example videos and annotations from our Distinct Describable Moments (DiDeMo) dataset.
Five-second clips representation is extracted from VGG FC7-layer. A moment feature vector is constructed using local moment, global context, and temporal endpoints information.
Sentence embedding using the dense Glove word embeddings.
Moment Context Network (MCN) Architecture
MCN Loss function
Natural language moment retrieval results on DiDeMo. Ground truth moments are outlined in yellow. The Moment Context Network (MCN) localizes diverse descriptions which include temporal indicators, such as “first” (top), and camera words, such as “camera zooms” (middle).
The Moment Context Network (MCN) outperforms baselines (rows 1–6) on our test set.
  • MCN is one of the simplest architectures employed to address text to video retrieval problem. The authors publish their Caffe implementation. Thus, the paper is a good starter for those getting into this area.
  • That being said, I felt some technical details are missing. Details like video frame per second (fps) and frame sampling rates are important when working with videos to be able to replicate the results — e.g. issue #1, issue #2. A VGG is utilized to extract visual features, which VGG? There is at least VGG-16 and VGG-19
  • The CCA baseline uses the MCN input visual features. Since this is the most competitive baseline, I wonder if these visual features included the moment timestamp information (start and end) or not? Not stated in the paper.
  • Finally, I think stronger baselines exist. More complicated architectures solve a similar problem. But I am not sure about the timeline since this paper is published in 2017.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.