Boosting Standard Classification Architectures Through a Ranking Regularizer

Standard classification architectures (e.g, ResNet and DesneNet) achieve great performance. However, they can not answer the following question: What is the nearest neighbor image to a given query image? This question reveals an underlying limitation of the softmax loss. The softmax loss, used in training classification models, is prone to overfitting. It achieves superior classification performance, yet with an inferior class embedding. To address this limitation, recent literature [2,3] assumes a fixed number of modes per class as shown in the next figure. This assumption requires an expert‘s user-input and raises complexity for imbalanced datasets. However, this paper [1] tackles this limitation with a simpler approach.

Visualization of softmax and feature embedding regularizers. Softmax separates samples with neither class compactness nor margin maximization considerations. Center loss promotes unimodal compact class while magnet loss supports multi-modal embedding. Triplet center loss strives for unimodal, margin maximization and class compactness. The computed classes’ centers are depicted using a star symbol.
The proposed two-head architecture. The last convolutional feature map (h) supports both embedding and classification heads. Operations and dimensions are highlighted with blue and pink colors, respectively. ResNet-50 dimensions used for illustration
Quantitative evaluation using the five FGVR datasets on ResNet-50, Inception-V4, and DenseNet-161.
Quantitative evaluation using the Honda driving dataset. Each row denotes the accuracy of a particular class (higher is better). Our two-head architecture using semi-hard triplet loss achieves better performance on minority classes.
Comparative quantitative evaluation between retrieval and classification as an upper bound. Both retrieval and classification accuracies are comparable. Retrieval top 4 is superior to classification top 1.
Retrieval qualitative evaluation on three FGVR datasets: Flowers-102, Aircrafts, and Cars. Given a query image, the three nearest neighbors are depicted. The three consecutive rows show search results using center loss, semi-hard, and hard triplet regularizers. Green and red outlines denote match and mismatch between the query and its result respectively.
Detailed feature embedding quantitative analysis across the five datasets using ResNet-50 architecture’s penultimate layer. Triplet with hard mining achieves superior embedding with ResNet-50 trained for 40K iterations. Center loss suffers a high instability. The vanilla method denotes softmax loss only.
  • The paper code is available on Github. The GPU implementation is quite fast and increases training time marginally.
  • The proposed idea is simple and surprisingly powerful. The paper provides intensive feature embedding (embedding space) quantitative evaluations.
  • The main paper’s limitation is the lack of experiments on very large datasets. The biggest dataset used is NABirds which has 48,562 images. This is relatively small by the current (the year 2020) standards.
  • The paper claims the two-head architecture boosts minority classes’ performance. However, it is evaluated on a single imbalanced dataset. Evaluation through various imbalanced datasets is needed to support this claim.
  • There is a huge metric learning (ranking losses) literature in terms of better losses and sampling strategies that are omitted in the paper. The paper only uses variants of triplet loss — hard and semi-hard!

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store