No Fuss Distance Metric Learning using Proxies

Distance metric learning (DML) strives for an object embedding space consistent with a semantic similarity; a space where objects from the same class are closer than objects from different classes. Beyond object classification, DML provides a similarity metric between objects. This enables clustering, retrieval, and nearest neighbor applications. Triplet loss is a prominent DML loss used for face-recognition/verification and person re-identification. Its sampling strategy identifies triplets (a,p,n) of anchors, positives and negatives. The differentiable loss attracts positives to their corresponding anchors while repelling negatives away.

Given a labeled dataset of N objects, there exist O(N³) triplets. Most triplets are easy, suffer zero loss, and useless during training. This hinders training convergence. A superior sampling strategy identifies hard or semi-hard triplets to speed convergence. The cubic bound O(N³) poses a computational limitation for identifying these triplets. For a batch of 32 images, a sampling procedure is required to prob O(32³) triplets.

This paper proposes an approach to relax the sampling procedure complexity and speed convergence. Instead of sampling the positives and negatives from N objects, they are sampled/assigned from P << N objects called proxies. Thus, a triplet becomes (a, P+(a),P-(a)). The next figure presents three classes, in distinct colors. Each class has three samples; thus total N=9 objects. Hard or semi-hard triplet loss sampling requires evaluating O(9³) triplets. Learning P << N proxies for the N objects relaxes the sampling complexity to be O(P³).

N=9 objects, in circles, from three distinct classes. P=3 stars represent learned proxies. Sample anchor (pink circle outlined in cyan) is assigned to positive proxy (pink star) and negative proxies (blue/green stars)

To learn the proxies, a proxy loss is proposed as follows

Proxy Loss

where D(x,y) is the distance between two objects. H is the Heaviside step function, substituted by a differentiable function to enable gradient descent optimization. The paper suggests margin-based triplet loss or Neighborhood Component Analysis (NCA) for H as follows

The margin-based triplet loss and neighborhood component analysis Surrogates for H function.

The novel proxy loss is evaluated on three datasets: Cars196, Stanford Online Products, and CUB200. Both retrieval @ K and Normalized Mutual Information measure (NMI) metrics are used for quantitative evaluation. NMI is the ratio of the mutual information of the clustering and ground truth, to their harmonic mean as follows

Normalized Mutual Information equation

Where H(.) is the entropy, I(A, B) is the mutual information between A and B. These slides have some NMI numerical examples. The convergence rate of the proxy loss is its main strength. The next figure shows its convergence rate compared to other DML methods.

Recall@1 as a function of training step on the Cars196 dataset. Proxy-NCA converges about three times as fast compared with the baseline methods, and results in higher Recall@1 values

The following tables show quantitative evaluations on the three datasets.

Retrieval and Clustering Performance on the Cars196 dataset. Bold indicates best results
Retrieval and Clustering Performance on the CUB200 dataset.
Recall@1 results on the Stanford Product Dataset. Proxy-NCA has a 6% point gap with previous SOTA.

Qualitative evaluation, in the next figure, highlights retrieval performance on Cars and Stanford outline products datasets

The paper outlines an extension for zero-shot learning that is omitted in this article. A pytorch implementation is available in this Github link.

My Comments

  • Learning proxies to relax hard mining complexity is the main contribution. It is a valuable and intuitive idea. It speeds DML convergence and achieves state-of-the-art results.
  • Imbalance datasets are disregarded in the paper, in most DML papers to be fair. Learned proxies, for imbalance data, can be biased towards majority classes, i.e. A majority class gets multiple proxies while multiple minority classes get merged into a single proxy.
  • The proxy loss is similar to the center loss. Magnet loss, a generalization of center loss, is mentioned in the related work but not center loss. I wish the proxy loss was compared to the center loss. Maybe it is omitted because center loss is used in conjunction with softmax loss, i.e. for classification purpose.
  • I am not sure how to extend this approach to person reidentification or face identification/verification problems. There are no *classes* in these problems, there are a lot of identities, the market person-reidentification dataset has 1501 identifies. Do we need to learn P=1501 person identities? how that extrapolates to test/unseen person identifies?
  • From my previous experience, using center loss for person reidentification learns an inferior embedding. So I am pessimistic about the proxy loss in similar problems.
  • Minor issue: I found the mathematical formulation in the paper a bit confusing and sometimes dispensable. I think appendix and supplementary material are better places to emphasize details like proxy loss bounds.
  • Minor technical issue: None official datasets splits (train/test) are utilized in the experiment sections. This makes comparative evaluation against other published results, like classification accuracies, challenging.

I write reviews on computer vision papers. Writing tips are welcomed.