Boosting Standard Classification Architectures Through a Ranking Regularizer
Standard classification architectures (e.g, ResNet and DesneNet) achieve great performance. However, they can not answer the following question: What is the nearest neighbor image to a given query image? This question reveals an underlying limitation of the softmax loss. The softmax loss, used in training classification models, is prone to overfitting. It achieves superior classification performance, yet with an inferior class embedding. To address this limitation, recent literature [2,3] assumes a fixed number of modes per class as shown in the next figure. This assumption requires an expert‘s user-input and raises complexity for imbalanced datasets. However, this paper [1] tackles this limitation with a simpler approach.
The main idea is to leverage the metric learning literature to regularize the embedding space while keeping the softmax loss. Concretely, this is achieved by adding a single fully connected layer on top of the pre-logit layer to generate an embedding head as shown in the next figure. In this paper, a triplet loss is applied to the embedding head while softmax is applied to the original classification head.
This simple extension (a single fully connected layer) brings the nearest neighbor capability to classification architectures. The extension boosts the classification performance by pulling similar points together while pushing different classes apart, i.e., margin maximization. The proposed idea supports the most recent classification architectures — MobileNet & InceptionNet. Furthermore, it comes at a very cheap computational cost— 2% increase in the training time.
The proposed two-head architecture is evaluated on five fine-grained recognition datasets: Flower102, Stanford Cars, Aircrafts, Dogs, and NABirds as shown in the next figure. The two-head architecture brings steady classification improvement even on the state-of-the-architecture — DenseNet.
Then, the two-head architecture is evaluated on an imbalanced video dataset: Honda Research Institute driving dataset. The next figure presents the classification performance on each class individually to emphasize the improvement margin on minority classes — towards the bottom of the table.
Then, the paper evaluates the nearest neighbor retrieval using the recall@k (Top#K) metric. The next table shows that recall@1 is comparable with the classification performance while recall@4 is always superior. This is an implicit indication of a superior embedding space.
The next figure shows a qualitative retrieval evaluation.
Finally, the supplementary material presents a detail space embedding evaluation using multiple metrics (NMI and Recall@K). The evaluation compares performance with and without (softmax only) the two-head architecture. As presented in the next table, the triplet loss regularizer always improves the embedding space quality with a significant margin.
My comments
- The paper provides many quantitative evaluations using different architectures, datasets and classification tasks.
- The paper code is available on Github. The GPU implementation is quite fast and increases training time marginally.
- The proposed idea is simple and surprisingly powerful. The paper provides intensive feature embedding (embedding space) quantitative evaluations.
- The main paper’s limitation is the lack of experiments on very large datasets. The biggest dataset used is NABirds which has 48,562 images. This is relatively small by the current (the year 2020) standards.
- The paper claims the two-head architecture boosts minority classes’ performance. However, it is evaluated on a single imbalanced dataset. Evaluation through various imbalanced datasets is needed to support this claim.
- There is a huge metric learning (ranking losses) literature in terms of better losses and sampling strategies that are omitted in the paper. The paper only uses variants of triplet loss — hard and semi-hard!
References
[1] Boosting Standard Classification Architectures Through a Ranking Regularizer
[2] A discriminative feature learning approach for deep face recognition.
[3] Triplet-center loss for multi-view 3d object retrieval.