Yet, another self-supervised learning approach :)
In this paper, an auxiliary task is proposed to boost crowd counting performance in crowd images. Counting people in the following image is a cumbersome task. Thus, labeled datasets are scarce and expensive to prepare. Counting people is important in video surveillance, safety monitoring, and behavior analysis. Also, counting objects has important applications in medical and biological image processing and vehicle counting.
In this paper, the two main contributions are (1) proposing a self-supervised auxiliary task to boost counting performance and (2) promoting a new paradigm for leveraging this auxiliary task during training. Other marginal contributions like preparing new datasets of crowd images are omitted in this article. The main idea presented is using image ranking as pre-text to boost counting performance. The following figure contains three sub-images (I_1, I_2, I_3). Without counting people, it is guaranteed that box I_1 has equal or more people than I_2 and I_3 accordingly. Thus, training a neural network to learn image ranking based on people count will boost traditional supervised approaches performance.
The second main contribution is the multi-task training. The standard approach to exploit self-supervised learning is training the self-supervised task first, then fine-tune the resulting network on the supervised task which has limited data. It is shown that this approach, which is used by the vast majority of self-supervised methods produce inferior results for crowd counting. The proposed self-supervision is added as a proxy task to supervised crowd counting in a multi-task network. To support such claim, three different training paradigms are evaluated: ranking then fine-tuning (counting), alternating ranking and counting, and multi-task ranking and counting simultaneously.
Training on the two tasks, ranking and counting, simultaneously achieves the best result. Thus, the final training pipeline is the following
The counting loss uses mean square error (MSE) as a loss function. For each image, the counting network branch produces an image with dimension 14x14. These are crowd density maps which indicate persons count per pixel as in the following image.
The counting loss term is
where M is the number of images in a training batch, y_i is ground truth person density map of the i-th image in the batch, and the prediction from the network as yˆ_i
For ranking images, average pooling is applied to the crowd density maps to estimate of the number of persons per spatial unit cˆ(I_i) according to
where x_j are the spatial coordinates of the density map, and M = 14 × 14 is the number of spatial units in the density map. The loss function is a combination between the counting and ranking loss using a hyper-parameter lambda.
Table 2 shows different training paradigms evaluation. Multi-task training outperforms both fine-tuning and alternating task learning. Another important conclusion, in the second row, is the effectiveness of ImageNet pre-trained weights for training a crowd counting network. This is surprising because previous approaches never exploit such utility.
Table 3 and 4 present a quantitative evaluation on two different labeled datasets — UCF CC 50 dataset and ShanghaiTech dataset. In these experiments, ranking images are drawn from two different datasets prepared by the authors through Google images. One ranking dataset is constructed by “keyword query” while the other by “Query-by-example image retrieval”
Figure 5 shows qualitative results in terms of crowd density maps
- The paper is well organized and easy to understand. Minor style issue in table no.4, inconsistent text size with the rest of the tables.
- The author argues that image ranking is a “poorly-defined nature self-supervised task”. The network could decide to count anything, e.g. ‘hats’, ‘trees’, or ‘people’, all of which would agree with the ranking constraints. I agree; I think tiny faces detection would be a better complementary task. The recent interest in tiny face detection supplies labeled datasets, e.g. WIDER FACE dataset, that can be leveraged.
- The author states that using detection approaches for counting would fail in extremely dense crowded scenes due to occlusion and low resolution of persons. I wish a quantitative evidence was provided to support such a claim.