# No Fuss Distance Metric Learning using Proxies

Distance metric learning (DML) strives for an object embedding space consistent with a semantic similarity; a space where objects from the same class are closer than objects from different classes. Beyond object classification, DML provides a similarity metric between objects. This enables clustering, retrieval, and nearest neighbor applications. Triplet loss is a prominent DML loss used for face-recognition/verification and person re-identification. Its sampling strategy identifies triplets (a,p,n) of anchors, positives and negatives. The differentiable loss attracts positives to their corresponding anchors while repelling negatives away.

Given a labeled dataset of N objects, there exist O(N³) triplets. Most triplets are easy, suffer zero loss, and useless during training. This hinders training convergence. A superior sampling strategy identifies hard or semi-hard triplets to speed convergence. The cubic bound O(N³) poses a computational limitation for identifying these triplets. For a batch of 32 images, a sampling procedure is required to prob O(32³) triplets.

This paper proposes an approach to relax the sampling procedure complexity and speed convergence. Instead of sampling the positives and negatives from N objects, they are sampled/assigned from P << N objects called proxies. Thus, a triplet becomes (a, P+(a),P-(a)). The next figure presents three classes, in distinct colors. Each class has three samples; thus total N=9 objects. Hard or semi-hard triplet loss sampling requires evaluating O(9³) triplets. Learning P << N proxies for the N objects relaxes the sampling complexity to be O(P³).

To learn the proxies, a proxy loss is proposed as follows

where D(x,y) is the distance between two objects. H is the Heaviside step function, substituted by a differentiable function to enable gradient descent optimization. The paper suggests margin-based triplet loss or Neighborhood Component Analysis (NCA) for H as follows

The novel proxy loss is evaluated on three datasets: Cars196, Stanford Online Products, and CUB200. Both retrieval @ K and Normalized Mutual Information measure (NMI) metrics are used for quantitative evaluation. NMI is the ratio of the mutual information of the clustering and ground truth, to their harmonic mean as follows

Where H(.) is the entropy, I(A, B) is the mutual information between A and B. These slides have some NMI numerical examples. The convergence rate of the proxy loss is its main strength. The next figure shows its convergence rate compared to other DML methods.

The following tables show quantitative evaluations on the three datasets.

Qualitative evaluation, in the next figure, highlights retrieval performance on Cars and Stanford outline products datasets

The paper outlines an extension for zero-shot learning that is omitted in this article. A **pytorch** **implementation** is available in this Github link.

**My Comments**

- Learning proxies to relax hard mining complexity is the main contribution. It is a valuable and intuitive idea. It speeds DML convergence and achieves state-of-the-art results.
- Imbalance datasets are disregarded in the paper, in most DML papers to be fair. Learned proxies, for imbalance data, can be biased towards majority classes, i.e. A majority class gets multiple proxies while multiple minority classes get merged into a single proxy.
- The proxy loss is similar to the center loss. Magnet loss, a generalization of center loss, is mentioned in the related work but not center loss. I wish the proxy loss was compared to the center loss. Maybe it is omitted because center loss is used in conjunction with softmax loss, i.e. for classification purpose.
- I am not sure how to extend this approach to person reidentification or face identification/verification problems. There are no *classes* in these problems, there are a lot of identities, the market person-reidentification dataset has 1501 identifies. Do we need to learn P=1501 person identities? how that extrapolates to test/unseen person identifies?
- From my previous experience, using center loss for person reidentification learns an inferior embedding. So I am pessimistic about the proxy loss in similar problems.
- Minor issue: I found the mathematical formulation in the paper a bit confusing and sometimes dispensable. I think appendix and supplementary material are better places to emphasize details like proxy loss bounds.
- Minor technical issue: None official datasets splits (train/test) are utilized in the experiment sections. This makes comparative evaluation against other published results, like classification accuracies, challenging.