Representation Learning by Learning to Count

4 min readMar 10, 2020

Transfer learning (aka Fine-tuning) is a core advantage of deep supervised learning. However, supervised learning requires labeled datasets which are expensive to acquire. Unsupervised/self-supervised learning is a cheaper alternative to supervised approaches. To avoid the costly annotation, an unsupervised learning approach leverages a pre-text task as a supervision signal. For example, Gidaris et al.[2] rotate images and predict the rotation angle as a supervision signal. Similarly, Pathak et al.[3] recover an image patch from the surrounding pixels. This paper [1] proposes a novel pre-text task for unsupervised learning.

The paper proposes a new simple, yet efficient, pre-text task, i.e., If we partition an image into non-overlapping regions, the number of visual primitives in each region should sum up to the number of primitives in the original image. The next figure shows that summing the count of visual primitives in an image’s tiles should match the count of visual primitives in the whole image. For example, The four tiles outline in red have {1,0,0,1} noses; accordingly, the whole image should have 2 noses. Concretely, The paper trains a neural network to count the number of visual primitives in an image.

The number of visual primitives in the whole image should match the sum of the number of visual primitives in each tile (dashed red boxes).

Given an image x of size 228x228, the loss function of the counting pretext can be formulated as follow

The loss function of the counting pretext on an image x.

where D(x) downsamples the image x by 2 into an image of size 114x114, T_i(x) indicates the tile #i, of size 114x114, from the original image x and N(.) is the network’s output representation.

This equation can be regarded as a contrastive loss, pulling the downsampled image representation D(x) closer to the representation from the four tiles T_i(x). One caveat, with the previous loss function, is that it has a trivial solution, i.e., N(.) equal zero. To avoid this solution, another term is added to include a “negative” image as follows

The final loss function add a triplet loss style term to avoid the trivial solution.

where [..]+ is max(0,..), M is scalar margin, y is a negative image randomly sampled. Basically, this triplet-loss style term pushes the learned representation of image x from that of image y. The next figure shows AlexNet architecture applied on two images (x,y).

Training AlexNet to learn to count with two images: x (two dogs) and y (ship image). The proposed architecture uses a siamese arrangement so that we simultaneously produce features for 4 tiles and a downsampled image (two dogs). We also compute the feature from a randomly chosen downsampled image (ship image) to avoid the trivial solution.

where \phi symbol denotes the network’s output representation, i.e., it is N(.) in the previous two equations.

The counting network is trained using ImageNet datasets. The quality of the learned representation is quantitatively evaluated on three datasets: PASCAL, ImageNet, and Places. Using the PASCAL dataset, the pre-trained network, with the counting pretext, is fine-tuned on PASCAL VOC 2007 and VOC 2012. This is an established benchmark for object classification, detection and segmentation tasks as shown in the next table

Evaluation of transfer learning on PASCAL. Classification and detection are evaluated on PASCAL VOC2007. Segmentation is evaluated on PASCAL VOC 2012

The counting network is already trained on ImageNet dataset. Thus, for evaluation, a linear classifier on top of the frozen layers is trained on both ImageNet and Places. The next two tables present quantitative evaluation on both datasets

ImageNet classification with a linear classifier.

Places classification with a linear classifier.

The paper provides further quantitative and qualitative evaluations that are omitted from this article.

My Comments:

There is a Tensorflow (TF) implementation on Github but not by the authors. This TF implementation employs VGG-16 instead of AlexNet! Thus, further modifications are required to use this code and compare results with related literature. I can only blame the paper’s authors, not Shao-Hua Sun, for that.
EDIT [Dec 2020]: There are serious issues with Shao-Hua Sun’s implementation, so I developed my own TF implementation.
The paper idea is simple and seems to work. I like simple ideas.
Many unsupervised learning approaches employ the same architecture, AlexNet, for quantitative evaluation. This is required for fair comparison but I wonder if these learning approaches remain valuable when applied to recent architectures like ResNet or DenseNet.
That being said, most papers probably adapt AlexNet, an old architecture, to make training and evaluation computationally cheaper. AlexNet is computationally feasible for a lot of scholars with limited GPU capability. Poor graduate students :)
There is one technical point that surprises me; the authors randomly pick either the bicubic, bilinear, Lanczos, or the area method to downsample the image. It is reported that this randomization of different downsampling methods significantly improves the detection performance by at least 2.2%!

Resources

[1] Representation Learning by Learning to Count

[2] Unsupervised Representation Learning by Predicting Image Rotations

[3] Context encoders: Feature learning by inpainting

Representation Learning by Learning to Count

Written by Ahmed Taha

No responses yet