Less is More: Learning Highlight Detection from Video Duration
The goal in video highlight detection is to retrieve a moment — in the form of a short video clip — that captures a user’s primary attention or interest within an unedited video as shown in the next Figure. An efficient highlight detection approach improves the video browsing experience, enhance social video sharing, and facilitate video recommendation. Supervised highlight detection approaches require a dataset of unedited videos with their corresponding manually annotated highlights, i.e., video-highlight pairs. These datasets are very expensive to collect and create.
This paper [1] avoids the expensive supervision entailed by collecting video-highlight pairs. The authors propose a framework to learn highlight detection from a large collection of unlabeled videos. The main idea is that “shorter user-uploaded videos tend to have a key focal point as the user is more selective about the content, whereas longer ones may not have every second be as crisp or engaging”. To convey a message in a short video, more effort is required to film only the significant moments, or else manually edit them out later. Thus, a video duration provides a free latent signal to train a neural network for highlight detection.
Concretely, given two video segments (s_i,s_j) extracted from a short and a long video, a network is trained to predict high highlight-score for s_i compared to s_j. This is achieved using a ranking loss, i.e., a triplet loss as follows:
To train the neural network f(x), an unlabeled dataset is collected from Instagram using queries hashtag — 10M hashtagged Instagram videos. This article omits the dataset collection details but the following Figure shows the distribution of videos’ durations.
The “less is more” ranking formulation assumes no noisy data. The 10M Instagram dataset violates this assumption because the dataset is crawled from Instagram without any human supervision. To mitigate the impact of noisy videos, the vanilla ranking formulation is modified to include a weight parameter (w). This parameter quantifies the validity of a pair of videos. The parameter w is big and small for valid and noisy video pairs, respectively. The parameter w is learned during training using another neural network h(x_i,x_j). Thus, the final formulation becomes
It is not trivial to learn w_{ij} with these constraints. Thus, the paper proposes a workaround like dividing the training mini-batch into groups and applying a softmax function on w_{ij} in each group. I elaborate more on this noise mitigation approach at the end of this article.
The proposal is evaluated quantitatively using two datasets: (1) YouTube Highlights and (2) TVSum. The next Figures present a quantitative evaluation with both supervised and unsupervised baselines.
The proposed “less is more” formulation has two variants: domain Agnostic and domain Specific. Ours-A is a domain-agnostic variant, where all training videos are aggregated from all queried tags. Thus, a single model is trained and evaluated for all experiments. Ours-S is the domain-specific variant. A single model is trained on each queried tag. For example, train a model on videos crawled with the “dog” tag and another model on videos crawled with the “skiing” tag. Thus, the domain-specific variant is more accurate. However, this performance comes at the expense of more models — one per tag.
My Comments:
- The paper is well-written and I recommend it to people interested in ranking and self-supervised learning.
- “Less is more” is an interesting idea that has been used in 2D images for crowd counting. I wonder if the same idea has more applications in 2D and 3D (medical) images.
- To handle noisy data, the authors used a latent variable learned through a separate neural network — h(x_i,x_j). The neural network h quantifies uncertainty given a pair of video segments (x_i,x_j). However, I expected a Bayesian approach like [2]. First, the uncertainty should be learned using the original neural network f(x) as a separate extra dimension — and not using a separate network h(x_i,x_j). The extra network h(x_i,x_j) increases the complexity and computational cost of the proposal. Second, h(x_i,x_j) computes uncertainty given a pair of video segments. This sounds imperfect because every segment should have its own uncertainty.
- That being said, the current way of handling noisy data has single merit. It enables a manual prior integration in the loss function. For example, training with p = 0.8 tells the system that about 80% of the pairs are a priori expected to be valid. Yet, of course, the paper employs p as a hyperparameter because there is no way to quantify this prior.
[1] Less is More: Learning Highlight Detection from Video Duration
[2] What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?