Representation Learning by Learning to Count

The number of visual primitives in the whole image should match the sum of the number of visual primitives in each tile (dashed red boxes).
The loss function of the counting pretext on an image x.
The final loss function add a triplet loss style term to avoid the trivial solution.
Training AlexNet to learn to count with two images: x (two dogs) and y (ship image). The proposed architecture uses a siamese arrangement so that we simultaneously produce features for 4 tiles and a downsampled image (two dogs). We also compute the feature from a randomly chosen downsampled image (ship image) to avoid the trivial solution.
Evaluation of transfer learning on PASCAL. Classification and detection are evaluated on PASCAL VOC2007. Segmentation is evaluated on PASCAL VOC 2012
ImageNet classification with a linear classifier.
Places classification with a linear classifier.
  1. There is a Tensorflow (TF) implementation on Github but not by the authors. This TF implementation employs VGG-16 instead of AlexNet! Thus, further modifications are required to use this code and compare results with related literature. I can only blame the paper’s authors, not Shao-Hua Sun, for that.
  2. EDIT [Dec 2020]: There are serious issues with Shao-Hua Sun’s implementation, so I developed my own TF implementation.
  3. The paper idea is simple and seems to work. I like simple ideas.
  4. Many unsupervised learning approaches employ the same architecture, AlexNet, for quantitative evaluation. This is required for fair comparison but I wonder if these learning approaches remain valuable when applied to recent architectures like ResNet or DenseNet.
  5. That being said, most papers probably adapt AlexNet, an old architecture, to make training and evaluation computationally cheaper. AlexNet is computationally feasible for a lot of scholars with limited GPU capability. Poor graduate students :)
  6. There is one technical point that surprises me; the authors randomly pick either the bicubic, bilinear, Lanczos, or the area method to downsample the image. It is reported that this randomization of different downsampling methods significantly improves the detection performance by at least 2.2%!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.