Metric learning learns a feature embedding that quantifies the similarity between objects and enables retrieval. Metric learning losses can be categorized into two classes: pair-based and proxy-based. The next figure highlights the difference between the two classes. Pair-based losses pull similar samples together while pushing different samples apart (data-to-data relations). Proxy-based losses compute class representative(s) during training. Then, samples are pulled towards their class representatives and push away from different representatives (data-to-proxy relations).
The next table summarizes the pros and cons of both proxy-based and pair-based losses. For instance, pair-based losses leverage fine-grained semantic relations between samples but suffer slow convergence. In contrast, proxy-based losses converge faster but with an inferior semantics between samples. This happens because proxy-based losses can leverage only data-to-proxy relations while pair-based losses leverage the rich data-to-data relations.
This paper  presents a new proxy-based loss that takes advantage of both pair- and proxy-based methods. The proposed Proxy-Anchor loss allows data points, in a training mini-batch, to be affected by each other through its gradients. Thus, unlike vanilla proxy-based losses, the proxy-anchor loss utilizes data-to-data relations during training like pair-based losses. The next figure illustrates how the proposed proxy-anchor loss is different from a proxy-NCA  (a typical proxy-based loss).
The main four differences between proxy-NCA and proxy-anchor are summarized in the paper as follows: (1) Gradients of Proxy-NCA loss with respect to positive examples have the same scale regardless of their hardness. (2) Proxy-Anchor loss dynamically determines gradient scales regarding relative hardness of all positive examples so as to pull harder positives more strongly. (3) In Proxy-NCA, each negative example is pushed only by a small number of proxies without considering the distribution of embedding vectors in fine details. (4) Proxy-Anchor loss considers the distribution of embedding vectors in more detail as all negative examples affect each other in their gradients.
The proxy-anchor loss converges much faster compared to other metric learning losses.
The proxy-anchor loss eliminates the requirement for an efficient mini-batch sampling strategy. Thus, it is computationally cheaper during training. The inference cost is the same for all losses.
The next figure presents proxy-anchor loss quantitative evaluation on two standard retrieval datasets: CUB-200–2011 and Stanford CAR196. Proxy-anchor loss achieves state-of-the-art results.
- The official repository of the paper is available on Github. The code implementation is well-organized and uses the PyTorch metric learning library.
- The paper is well-written and easy to read.
- The paper claims that the proxy-anchor loss is robust against noisy labels and outliers. Yet, this claim is neither supported nor rejected by any experiments in the paper.
- The paper assumes a single proxy (class representative) per class. It is a valid assumption on datasets with small intra-class variations. Yet, large intra-class variations break this assumption. This problem is clear when working with imbalanced datasets. Should a minority class have the same number of class representatives as a majority class?
 Proxy Anchor Loss for Deep Metric Learning.
 No-fuss distance metric learning using proxies.