Less is More: Learning Highlight Detection from Video Duration

Video frames from three shorter user-generated video clips (top row) and one longer user-generated video (second row). Although all recordings capture the same event (surfing), video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about their content. The height of the red curve indicates the highlight score over time. We leverage this natural phenomenon as a free latent supervision signal in large-scale Web video.
where x_i and x_j are the feature representations of video segment s_i and s_j, respectively. s_i and s_j are video-segments sampled from a short (<15 seconds) and long (>45 seconds) videos, respectively.
Durations for the 10M Instagram training videos
where x_i and x_j are the feature representations of video segment s_i and s_j, respectively. |P| is the total number of video pairs, and p is the anticipated proportion of pairs that are valid. For example, p=0.8 indicates that 80% of the video pairs are expected to be valid. w_{ij} is learned using a separate neural network to quantify the validity of a pair of videos. A pair of noisy videos should have small w_{ij}. This mitigates the impact of noisy pairs on the loss function L.
Highlight detection results (mAP) on YouTube Highlights. Our method outperforms all the baselines, including the supervised ranking-based methods [33, 9].
Highlight detection results (Top-5 mAP score) on TVSum. All methods listed are unsupervised. Our method outperforms all the baselines by a large margin. Entries with “-” mean per-class results not available for that method.
  • The paper is well-written and I recommend it to people interested in ranking and self-supervised learning.
  • “Less is more” is an interesting idea that has been used in 2D images for crowd counting. I wonder if the same idea has more applications in 2D and 3D (medical) images.
  • To handle noisy data, the authors used a latent variable learned through a separate neural network — h(x_i,x_j). The neural network h quantifies uncertainty given a pair of video segments (x_i,x_j). However, I expected a Bayesian approach like [2]. First, the uncertainty should be learned using the original neural network f(x) as a separate extra dimension — and not using a separate network h(x_i,x_j). The extra network h(x_i,x_j) increases the complexity and computational cost of the proposal. Second, h(x_i,x_j) computes uncertainty given a pair of video segments. This sounds imperfect because every segment should have its own uncertainty.
  • That being said, the current way of handling noisy data has single merit. It enables a manual prior integration in the loss function. For example, training with p = 0.8 tells the system that about 80% of the pairs are a priori expected to be valid. Yet, of course, the paper employs p as a hyperparameter because there is no way to quantify this prior.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.