Boosting Standard Classification Architectures Through a Ranking Regularizer

Standard classification architectures (e.g, ResNet and DesneNet) achieve great performance. However, they can not answer the following question: What is the nearest neighbor image to a given query image? This question reveals an underlying limitation of the softmax loss. The softmax loss, used in training classification models, is prone to overfitting. It achieves superior classification performance, yet with an inferior class embedding. To address this limitation, recent literature [2,3] assumes a fixed number of modes per class as shown in the next figure. This assumption requires an expert‘s user-input and raises complexity for imbalanced datasets. However, this paper [1] tackles this limitation with a simpler approach.

Visualization of softmax and feature embedding regularizers. Softmax separates samples with neither class compactness nor margin maximization considerations. Center loss promotes unimodal compact class while magnet loss supports multi-modal embedding. Triplet center loss strives for unimodal, margin maximization and class compactness. The computed classes’ centers are depicted using a star symbol.

The main idea is to leverage the metric learning literature to regularize the embedding space while keeping the softmax loss. Concretely, this is achieved by adding a single fully connected layer on top of the pre-logit layer to generate an embedding head as shown in the next figure. In this paper, a triplet loss is applied to the embedding head while softmax is applied to the original classification head.

The proposed two-head architecture. The last convolutional feature map (h) supports both embedding and classification heads. Operations and dimensions are highlighted with blue and pink colors, respectively. ResNet-50 dimensions used for illustration

This simple extension (a single fully connected layer) brings the nearest neighbor capability to classification architectures. The extension boosts the classification performance by pulling similar points together while pushing different classes apart, i.e., margin maximization. The proposed idea supports the most recent classification architectures — MobileNet & InceptionNet. Furthermore, it comes at a very cheap computational cost— 2% increase in the training time.

The proposed two-head architecture is evaluated on fine-grained recognition datasets: Flower102, Stanford Cars, Aircrafts, Dogs, and NABirds as shown in the next figure. The two-head architecture brings steady classification improvement even on the state-of-the-architecture — DenseNet.

Quantitative evaluation using the five FGVR datasets on ResNet-50, Inception-V4, and DenseNet-161.

Then, the two-head architecture is evaluated on an imbalanced video dataset: Honda Research Institute driving dataset. The next figure presents the classification performance on each class individually to emphasize the improvement margin on minority classes — towards the bottom of the table.

Quantitative evaluation using the Honda driving dataset. Each row denotes the accuracy of a particular class (higher is better). Our two-head architecture using semi-hard triplet loss achieves better performance on minority classes.

Then, the paper evaluates the nearest neighbor retrieval using the recall@k (Top#K) metric. The next table shows that recall@1 is comparable with the classification performance while recall@4 is always superior. This is an implicit indication of a superior embedding space.

Comparative quantitative evaluation between retrieval and classification as an upper bound. Both retrieval and classification accuracies are comparable. Retrieval top 4 is superior to classification top 1.

The next figure shows a qualitative retrieval evaluation.

Retrieval qualitative evaluation on three FGVR datasets: Flowers-102, Aircrafts, and Cars. Given a query image, the three nearest neighbors are depicted. The three consecutive rows show search results using center loss, semi-hard, and hard triplet regularizers. Green and red outlines denote match and mismatch between the query and its result respectively.

Finally, the supplementary material presents a detail space embedding evaluation using multiple metrics (NMI and Recall@K). The evaluation compares performance with and without (softmax only) the two-head architecture. As presented in the next table, the triplet loss regularizer always improves the embedding space quality with a significant margin.

Detailed feature embedding quantitative analysis across the five datasets using ResNet-50 architecture’s penultimate layer. Triplet with hard mining achieves superior embedding with ResNet-50 trained for 40K iterations. Center loss suffers a high instability. The vanilla method denotes softmax loss only.

My comments

  • The paper provides many quantitative evaluations using different architectures, datasets and classification tasks.
  • The paper code is available on Github. The GPU implementation is quite fast and increases training time marginally.
  • The proposed idea is simple and surprisingly powerful. The paper provides intensive feature embedding (embedding space) quantitative evaluations.
  • The main paper’s limitation is the lack of experiments on very large datasets. The biggest dataset used is NABirds which has 48,562 images. This is relatively small by the current (the year 2020) standards.
  • The paper claims the two-head architecture boosts minority classes’ performance. However, it is evaluated on a single imbalanced dataset. Evaluation through various imbalanced datasets is needed to support this claim.
  • There is a huge metric learning (ranking losses) literature in terms of better losses and sampling strategies that are omitted in the paper. The paper only uses variants of triplet loss — hard and semi-hard!

References

[1] Boosting Standard Classification Architectures Through a Ranking Regularizer

[2] A discriminative feature learning approach for deep face recognition.

[3] Triplet-center loss for multi-view 3d object retrieval.

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store