Metric learning literature assumes binary labels where samples belong to either the same or different classes. While this binary perspective has motivated fundamental ranking losses (e.g., Contrastive and Triplet loss), this binary perspective has reached a stagnant point [2]. Thus, one novel direction for metric learning is continuous (non-binary) similarity. This paper [1] promotes metric learning beyond binary supervision as shown in the next Figure.

(a) Existing methods categorize neighbors into positive and negative classes, and learn a metric space where positive images are close to the anchor and negative ones far apart. In such a space, the distance between a pair of images is not necessarily related to their semantic similarity since the order and degrees of similarities between them are disregarded. (b) This paper [1] allows distance ratios in the label space to be preserved in the learned metric space to overcome the aforementioned limitation.

The binary metric learning is not sufficient for objects with continuous similarity criteria such as image captions, human poses, and scene graphs. Thus, this paper [1] proposes a triplet loss variant, dubbed log-ratio loss, that…


This paper [1] quantifies the financial and environmental costs (CO2 emissions) of training a deep network. It also draws attention to the inequality between academia and industry in terms of computational resources. The paper uses NLP-architectures to present their case. Yet, the discussed issues are very relevant to the computer vision community.

The paper compares the amount of CO2 emitted by a familiar consumption (e.g., a car lifetime emission) versus a common NLP model (e.g., a transformer). Table 1 shows that training a transformer network emits significantly more CO2 compared to a fuel car.

Table 1: Estimated CO2 emissions from training common NLP models, compared to familiar consumption.

Then, the paper compares both the…


This paper [1] proposes an unsupervised framework for hard training-example mining. The proposed framework has two phases. Given a collection of unlabelled images, the first phase identifies positive and negative image pairs. Then, the second phase leverages these pairs to fine-tune a pretrained network.

Phase #1:

The first phase leverage a pretrained network to project the unlabelled images into an embedding space (Manifold) as shown in Fig.1.

Figure 1: A pretrained network embeds images into a manifold (feature space)

The manifold is used to create pairs/triplets from the unlabeled images. For an anchor image, the manifold provides two types of nearest neighbors: Euclidean (NN^e) and Manifold (NN^m) as shown in Fig.2. The…


This paper [1] proposes a tool, L2-CAF, to visualize attention in convolutional neural networks. L2-CAF is a generic visualization tool that can do everything CAM [3] and Grad-CAM [2] can do, but the opposite is not true.

Given a pre-trained CNN, an input x generates an output NT(x) — this is the solid green path in the next Figure. For the same input x, if the last convolutional layer’s output is multiplied by a constrained attention filter f, the network generates another output FT(x,f) — this is the dashed orange path. The filter f is randomly initialized then optimized using…


Metric learning learns a feature embedding that quantifies the similarity between objects and enables retrieval. Metric learning losses can be categorized into two classes: pair-based and proxy-based. The next figure highlights the difference between the two classes. Pair-based losses pull similar samples together while pushing different samples apart (data-to-data relations). Proxy-based losses compute class representative(s) during training. Then, samples are pulled towards their class representatives and push away from different representatives (data-to-proxy relations).

Proxy-based losses compute class representatives (stars) for each class. Data samples (circles) are pulled towards and push away from these class representatives (data-to-proxy). Pair-based losses pull similar data samples and push different data samples (data-to-data). The solid-green lines indicate a pull force, while the red-dashed lines indicate a push force.

The next table summarizes the pros and cons of both proxy-based and pair-based losses. For instance, pair-based losses leverage fine-grained semantic relations between samples but suffer slow…


This (G)old paper[1] tackles an interesting question: Why Does Unsupervised Pre-training Help Deep Learning? The authors support their conclusion with a ton of experiments. Yet, the findings contradict a common belief about unsupervised learning. That’s why I have contradicting feeling about this paper. I will present the paper first; then, follow up with my comments.

The authors strive to understand how does unsupervised pretrained help. There are two main hypotheses:

  1. Better optimization: Unsupervised pretraining puts the network in a region of parameter space where basins of attraction run deeper than when starting with random parameters. In simple words, the network…

This paper proposes PixelPlayer, a system to ground audio inside a video (frames) without manual supervision. Given an input video, PixelPlayer separates the accompanying audio into components and spatially localizes these components in the video. PixelPlayer enables us to listen to the sound originating from each pixel in the video as shown in the next Figure.

“Which pixels are making sounds?” Energy distribution of sound in pixel space. Overlaid heatmaps show the volumes from each pixel.

To train PixelPlayer using a neural network, a dataset is needed. The authors introduce a musical instrument video dataset for the proposed task, called MUSIC (Multimodal Sources of Instrument Combinations) dataset. This dataset is crawled from Youtube but with no manual annotation. MUSIC dataset…


The goal in video highlight detection is to retrieve a moment — in the form of a short video clip — that captures a user’s primary attention or interest within an unedited video as shown in the next Figure. An efficient highlight detection approach improves the video browsing experience, enhance social video sharing, and facilitate video recommendation. Supervised highlight detection approaches require a dataset of unedited videos with their corresponding manually annotated highlights, i.e., video-highlight pairs. These datasets are very expensive to collect and create.

Video frames from three shorter user-generated video clips (top row) and one longer user-generated video (second row). Although all recordings capture the same event (surfing), video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about their content. The height of the red curve indicates the highlight score over time. We leverage this natural phenomenon as a free latent supervision signal in large-scale Web video.

This paper [1] avoids the expensive supervision entailed by collecting video-highlight pairs. The authors propose…


Standard classification architectures (e.g, ResNet and DesneNet) achieve great performance. However, they can not answer the following question: What is the nearest neighbor image to a given query image? This question reveals an underlying limitation of the softmax loss. The softmax loss, used in training classification models, is prone to overfitting. It achieves superior classification performance, yet with an inferior class embedding. To address this limitation, recent literature [2,3] assumes a fixed number of modes per class as shown in the next figure. This assumption requires an expert‘s user-input and raises complexity for imbalanced datasets. …


Transfer learning (aka Fine-tuning) is a core advantage of deep supervised learning. However, supervised learning requires labeled datasets which are expensive to acquire. Unsupervised/self-supervised learning is a cheaper alternative to supervised approaches. To avoid the costly annotation, an unsupervised learning approach leverages a pre-text task as a supervision signal. For example, Gidaris et al.[2] rotate images and predict the rotation angle as a supervision signal. Similarly, Pathak et al.[3] recover an image patch from the surrounding pixels. This paper [1] proposes a novel pre-text task for unsupervised learning.

The paper proposes a new simple, yet efficient, pre-text task, i.e., If…

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store