Leveraging Unlabeled Data for Crowd Counting by Learning to Rank

Example of crowd image
Training ranking and counting simultaneously. A single mini-batch applies counting loss on images with count ground-truth. Ranking loss is applied to images without count ground-truth.
Crowd density maps generated by counting network pipeline
Estimate average person per pixel using average pooling over the crowd density maps
Loss function combines both counting and ranking terms
Table 2 shows different training paradigms evaluations
Quantitative evaluation on two different crowd counting labeled datasets — UCF CC 50 dataset and ShanghaiTech dataset
Qualitative results in terms of crowd density maps
  1. The paper is well organized and easy to understand. Minor style issue in table no.4, inconsistent text size with the rest of the tables.
  2. The author argues that image ranking is a “poorly-defined nature self-supervised task”. The network could decide to count anything, e.g. ‘hats’, ‘trees’, or ‘people’, all of which would agree with the ranking constraints. I agree; I think tiny faces detection would be a better complementary task. The recent interest in tiny face detection supplies labeled datasets, e.g. WIDER FACE dataset, that can be leveraged.
  3. The author states that using detection approaches for counting would fail in extremely dense crowded scenes due to occlusion and low resolution of persons. I wish a quantitative evidence was provided to support such a claim.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.