Less is More: Learning Highlight Detection from Video Duration

The goal in video highlight detection is to retrieve a moment — in the form of a short video clip — that captures a user’s primary attention or interest within an unedited video as shown in the next Figure. An efficient highlight detection approach improves the video browsing experience, enhance social video sharing, and facilitate video recommendation. Supervised highlight detection approaches require a dataset of unedited videos with their corresponding manually annotated highlights, i.e., video-highlight pairs. These datasets are very expensive to collect and create.

Video frames from three shorter user-generated video clips (top row) and one longer user-generated video (second row). Although all recordings capture the same event (surfing), video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about their content. The height of the red curve indicates the highlight score over time. We leverage this natural phenomenon as a free latent supervision signal in large-scale Web video.

This paper [1] avoids the expensive supervision entailed by collecting video-highlight pairs. The authors propose a framework to learn highlight detection from a large collection of unlabeled videos. The main idea is that “shorter user-uploaded videos tend to have a key focal point as the user is more selective about the content, whereas longer ones may not have every second be as crisp or engaging”. To convey a message in a short video, more effort is required to film only the significant moments, or else manually edit them out later. Thus, a video duration provides a free latent signal to train a neural network for highlight detection.

Concretely, given two video segments (s_i,s_j) extracted from a short and a long video, a network is trained to predict high highlight-score for s_i compared to s_j. This is achieved using a ranking loss, i.e., a triplet loss as follows:

where x_i and x_j are the feature representations of video segment s_i and s_j, respectively. s_i and s_j are video-segments sampled from a short (<15 seconds) and long (>45 seconds) videos, respectively.

To train the neural network f(x), an unlabeled dataset is collected from Instagram using queries hashtag — 10M hashtagged Instagram videos. This article omits the dataset collection details but the following Figure shows the distribution of videos’ durations.

Durations for the 10M Instagram training videos

The “less is more” ranking formulation assumes no noisy data. The 10M Instagram dataset violates this assumption because the dataset is crawled from Instagram without any human supervision. To mitigate the impact of noisy videos, the vanilla ranking formulation is modified to include a weight parameter (w). This parameter quantifies the validity of a pair of videos. The parameter w is big and small for valid and noisy video pairs, respectively. The parameter w is learned during training using another neural network h(x_i,x_j). Thus, the final formulation becomes

where x_i and x_j are the feature representations of video segment s_i and s_j, respectively. |P| is the total number of video pairs, and p is the anticipated proportion of pairs that are valid. For example, p=0.8 indicates that 80% of the video pairs are expected to be valid. w_{ij} is learned using a separate neural network to quantify the validity of a pair of videos. A pair of noisy videos should have small w_{ij}. This mitigates the impact of noisy pairs on the loss function L.

It is not trivial to learn w_{ij} with these constraints. Thus, the paper proposes a workaround like dividing the training mini-batch into groups and applying a softmax function on w_{ij} in each group. I elaborate more on this noise mitigation approach at the end of this article.

The proposal is evaluated quantitatively using two datasets: (1) YouTube Highlights and (2) TVSum. The next Figures present a quantitative evaluation with both supervised and unsupervised baselines.

Highlight detection results (mAP) on YouTube Highlights. Our method outperforms all the baselines, including the supervised ranking-based methods [33, 9].

The proposed “less is more” formulation has two variants: domain Agnostic and domain Specific. Ours-A is a domain-agnostic variant, where all training videos are aggregated from all queried tags. Thus, a single model is trained and evaluated for all experiments. Ours-S is the domain-specific variant. A single model is trained on each queried tag. For example, train a model on videos crawled with the “dog” tag and another model on videos crawled with the “skiing” tag. Thus, the domain-specific variant is more accurate. However, this performance comes at the expense of more models — one per tag.

Highlight detection results (Top-5 mAP score) on TVSum. All methods listed are unsupervised. Our method outperforms all the baselines by a large margin. Entries with “-” mean per-class results not available for that method.

My Comments:

  • The paper is well-written and I recommend it to people interested in ranking and self-supervised learning.
  • “Less is more” is an interesting idea that has been used in 2D images for crowd counting. I wonder if the same idea has more applications in 2D and 3D (medical) images.
  • To handle noisy data, the authors used a latent variable learned through a separate neural network — h(x_i,x_j). The neural network h quantifies uncertainty given a pair of video segments (x_i,x_j). However, I expected a Bayesian approach like [2]. First, the uncertainty should be learned using the original neural network f(x) as a separate extra dimension — and not using a separate network h(x_i,x_j). The extra network h(x_i,x_j) increases the complexity and computational cost of the proposal. Second, h(x_i,x_j) computes uncertainty given a pair of video segments. This sounds imperfect because every segment should have its own uncertainty.
  • That being said, the current way of handling noisy data has single merit. It enables a manual prior integration in the loss function. For example, training with p = 0.8 tells the system that about 80% of the pairs are a priori expected to be valid. Yet, of course, the paper employs p as a hyperparameter because there is no way to quantify this prior.

[1] Less is More: Learning Highlight Detection from Video Duration

[2] What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store