This paper [1] leverages two simple ideas to solve an important problem. The paper solves the problem of batch normalization when the batch size b is small, e.g., b=2. Small batch size is typical for an object-detection network where the input image size is 600–1024 pixels and the network has expensive operations such as the feature pyramid network (FPN).

The paper [1] proposes batch-normalization across mini-batch iterations. This is difficult because the normalization statistics (mean and std) change across iterations. We cannot accumulate the statistics naively across iterations because the network weights change between iterations. Formally, the paper solves the…


Every software engineer has used a debugger to debug his code. Yet, a neural network debugger… That’s news! This paper [1] proposes a debugger to debug and visualize attention in convolutional neural networks (CNNs).

Before describing the CNN debugger, I want to highlight few attributes of program debuggers (e.g., gdb): (I) Debuggers are not program-specific, i.e., we use the same debugger to debug many programs; (II) We use a debugger to debug a few lines in our code, i.e., not every line; (III) Before using a debugger, we decide consciously which part of the program we want to debug (e.g…


Thanks, Kubra, for sharing your thoughts. I never used Deep image prior. It is a cool idea, yet, its training cost is a barrier. As you mentioned, the total time spent on training is optimized; that's correct. Yet, this training time is incurred for every image, i.e., every image is a training sample, there is no inference.

Furthermore, I had a colleague who used deep prior for RGB-D images [1]. The training cost becomes more severe as the number of dimensions increases.

So, Yes, for 2D images the cost might be manageable, but I would proceed with caution for higher dimensions (e.g., videos). Thanks again for sharing your input :)

[1] Depth Completion Using a View-constrained Deep Prior


Metric learning literature assumes binary labels where samples belong to either the same or different classes. While this binary perspective has motivated fundamental ranking losses (e.g., Contrastive and Triplet loss), this binary perspective has reached a stagnant point [2]. Thus, one novel direction for metric learning is continuous (non-binary) similarity. This paper [1] promotes metric learning beyond binary supervision as shown in the next Figure.

(a) Existing methods categorize neighbors into positive and negative classes, and learn a metric space where positive images are close to the anchor and negative ones far apart. In such a space, the distance between a pair of images is not necessarily related to their semantic similarity since the order and degrees of similarities between them are disregarded. (b) This paper [1] allows distance ratios in the label space to be preserved in the learned metric space to overcome the aforementioned limitation.

The binary metric learning is not sufficient for objects with continuous similarity criteria such as image captions, human poses, and scene graphs. Thus, this paper [1] proposes a triplet loss variant, dubbed log-ratio loss, that…


This paper [1] quantifies the financial and environmental costs (CO2 emissions) of training a deep network. It also draws attention to the inequality between academia and industry in terms of computational resources. The paper uses NLP-architectures to present their case. Yet, the discussed issues are very relevant to the computer vision community.

The paper compares the amount of CO2 emitted by a familiar consumption (e.g., a car lifetime emission) versus a common NLP model (e.g., a transformer). Table 1 shows that training a transformer network emits significantly more CO2 compared to a fuel car.

Table 1: Estimated CO2 emissions from training common NLP models, compared to familiar consumption.

Then, the paper compares both the…


This paper [1] proposes an unsupervised framework for hard training-example mining. The proposed framework has two phases. Given a collection of unlabelled images, the first phase identifies positive and negative image pairs. Then, the second phase leverages these pairs to fine-tune a pretrained network.

Phase #1:

The first phase leverage a pretrained network to project the unlabelled images into an embedding space (Manifold) as shown in Fig.1.

Figure 1: A pretrained network embeds images into a manifold (feature space)

The manifold is used to create pairs/triplets from the unlabeled images. For an anchor image, the manifold provides two types of nearest neighbors: Euclidean (NN^e) and Manifold (NN^m) as shown in Fig.2. The…


This paper [1] proposes a tool, L2-CAF, to visualize attention in convolutional neural networks. L2-CAF is a generic visualization tool that can do everything CAM [3] and Grad-CAM [2] can do, but the opposite is not true.

Given a pre-trained CNN, an input x generates an output NT(x) — this is the solid green path in the next Figure. For the same input x, if the last convolutional layer’s output is multiplied by a constrained attention filter f, the network generates another output FT(x,f) — this is the dashed orange path. The filter f is randomly initialized then optimized using…


Metric learning learns a feature embedding that quantifies the similarity between objects and enables retrieval. Metric learning losses can be categorized into two classes: pair-based and proxy-based. The next figure highlights the difference between the two classes. Pair-based losses pull similar samples together while pushing different samples apart (data-to-data relations). Proxy-based losses compute class representative(s) during training. Then, samples are pulled towards their class representatives and push away from different representatives (data-to-proxy relations).

Proxy-based losses compute class representatives (stars) for each class. Data samples (circles) are pulled towards and push away from these class representatives (data-to-proxy). Pair-based losses pull similar data samples and push different data samples (data-to-data). The solid-green lines indicate a pull force, while the red-dashed lines indicate a push force.

The next table summarizes the pros and cons of both proxy-based and pair-based losses. For instance, pair-based losses leverage fine-grained semantic relations between samples but suffer slow…


This (G)old paper[1] tackles an interesting question: Why Does Unsupervised Pre-training Help Deep Learning? The authors support their conclusion with a ton of experiments. Yet, the findings contradict a common belief about unsupervised learning. That’s why I have contradicting feeling about this paper. I will present the paper first; then, follow up with my comments.

The authors strive to understand how does unsupervised pretrained help. There are two main hypotheses:

  1. Better optimization: Unsupervised pretraining puts the network in a region of parameter space where basins of attraction run deeper than when starting with random parameters. In simple words, the network…

This paper proposes PixelPlayer, a system to ground audio inside a video (frames) without manual supervision. Given an input video, PixelPlayer separates the accompanying audio into components and spatially localizes these components in the video. PixelPlayer enables us to listen to the sound originating from each pixel in the video as shown in the next Figure.

“Which pixels are making sounds?” Energy distribution of sound in pixel space. Overlaid heatmaps show the volumes from each pixel.

To train PixelPlayer using a neural network, a dataset is needed. The authors introduce a musical instrument video dataset for the proposed task, called MUSIC (Multimodal Sources of Instrument Combinations) dataset. This dataset is crawled from Youtube but with no manual annotation. MUSIC dataset…

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store