Learning Deep Features for Discriminative Localization

This paper leverages global average pooling (GAP) for weakly-supervised object localization and internal CNN representation visualization. The key idea is to apply GAP, on the last convolutional layer, before the single last fully connected layer. This enables discriminative image regions identification in a single forward-pass for a wide variety of tasks. Through GAP, class activation maps (CAM) are automatically generated. The proposed class activation mapping technique identifies the importance of the image regions by projecting back the weights of the output layer on to the convolutional feature maps as shown in the figure below.

Global average pooling (GAP) polls the last convolutional features (activation map) into 1D vector X. The last fully connected layer, from X to logits, identifies the importance of each activation map, i.e. convolutional feature, to every class.

A class activation map (CAM) is generated through a weighted sum of the last convolutional features (activation maps) using the fully connected layer weights per class as shown in the figure below. CAM for a particular class indicates the discriminative image regions used by the CNN to identify that class.

Class Activation Mapping: the predicted class score is mapped back to the previous convolutional layer to generate the class activation maps (CAMs). The CAM highlights the class-specific discriminative regions.

The CAM generation idea is simple and applicable to most recent CNN architectures because they employ a single fully connected layer. But to generate CAMs from a VGG, which has multiple fully connected layer, it is required to remove these extra layers and fine-tune the network again. Similar architectural tweaks are proposed in the paper to enable GAP/CAM. For example, the localization ability is improved when the last convolutional layer, before GAP, has a higher spatial resolution. Thus, extra convolutional layers are removed as well. For AlexNet, the layers after conv5 (i.e., pool5 to prob) are removed resulting in a mapping resolution of 13×13. Similar changes applied to VGG and GoogLeNet.

Such architectural tweaks lead to classification performance degradation as shown in the next table. This small performance drop is justified by the extra unsupervised localization capability added to the network.

Localization evaluation is performed against similar unsupervised and supervised approaches in tables 2 and 3 respectively.

The next figures show qualitative results and how CAM changes per class for a single input.

In the paper, further experiments highlight generic localization capabilities on un-seen datasets like Stanford action40, Caltech256, SUN397 and, UIUCEvent8.

References:

My comments:

[+1] The simplicity of GAP/CAM led to its popularity despite the requirement to tweak the network architectures. The approach is valid for both object and action recognition task as long as a valid architecture is employed.

[+1] The thing I like most about GAP is the ability to generate a per class activation map. To my limited knowledge, most attention and visualization paper generates a single attention map per-input.

[+1] The paper is well-written and easy to understand. The proposed method is evaluated on multiple dataset, architecture, and tasks. Only few tasks are covered in this article.

[-1] My main critic is the localization procedure employed for the quantitative evaluation. It felt adhoc with multiple heuristics. For example, (1) two bounding boxes (one tight and one loose) from are used for evaluation from the class activation map of the top 1st and 2nd predicted classes (2) one loose bounding boxes from the top 3rd predicted class. (3) Convolutional layers are removed to increase activation maps resolution and thus increase localization accuracy. Yet, the paper doesn’t study “what resolution is best?”

I write reviews on computer vision papers. Writing tips are welcomed.