Proxy Anchor Loss for Deep Metric Learning

4 min readAug 23, 2020

Metric learning learns a feature embedding that quantifies the similarity between objects and enables retrieval. Metric learning losses can be categorized into two classes: pair-based and proxy-based. The next figure highlights the difference between the two classes. Pair-based losses pull similar samples together while pushing different samples apart (data-to-data relations). Proxy-based losses compute class representative(s) during training. Then, samples are pulled towards their class representatives and push away from different representatives (data-to-proxy relations).

Proxy-based losses compute class representatives (stars) for each class. Data samples (circles) are pulled towards and push away from these class representatives (data-to-proxy). Pair-based losses pull similar data samples and push different data samples (data-to-data). The solid-green lines indicate a pull force, while the red-dashed lines indicate a push force.

The next table summarizes the pros and cons of both proxy-based and pair-based losses. For instance, pair-based losses leverage fine-grained semantic relations between samples but suffer slow convergence. In contrast, proxy-based losses converge faster but with an inferior semantics between samples. This happens because proxy-based losses can leverage only data-to-proxy relations while pair-based losses leverage the rich data-to-data relations.

This paper [1] presents a new proxy-based loss that takes advantage of both pair- and proxy-based methods. The proposed Proxy-Anchor loss allows data points, in a training mini-batch, to be affected by each other through its gradients. Thus, unlike vanilla proxy-based losses, the proxy-anchor loss utilizes data-to-data relations during training like pair-based losses. The next figure illustrates how the proposed proxy-anchor loss is different from a proxy-NCA [2] (a typical proxy-based loss).

Differences between Proxy-NCA and Proxy-Anchor in handling proxies and embedding vectors during training. Each proxy is colored in black and three different colors indicate distinct classes. The associations defined by the losses are expressed by edges, and thicker edges get larger gradients.

The main four differences between proxy-NCA and proxy-anchor are summarized in the paper as follows: (1) Gradients of Proxy-NCA loss with respect to positive examples have the same scale regardless of their hardness. (2) Proxy-Anchor loss dynamically determines gradient scales regarding relative hardness of all positive examples so as to pull harder positives more strongly. (3) In Proxy-NCA, each negative example is pushed only by a small number of proxies without considering the distribution of embedding vectors in fine details. (4) Proxy-Anchor loss considers the distribution of embedding vectors in more detail as all negative examples affect each other in their gradients.

The proxy-anchor loss converges much faster compared to other metric learning losses.

Accuracy in Recall@1 versus training time on the Cars- 196 dataset. Note that all methods were trained with a batch size of 150 on a single Titan Xp GPU. Proxy-anchor loss achieves the highest accuracy and converges faster than the baselines in terms of both the number of epochs and the actual training time.

The proxy-anchor loss eliminates the requirement for an efficient mini-batch sampling strategy. Thus, it is computationally cheaper during training. The inference cost is the same for all losses.

The next figure presents proxy-anchor loss quantitative evaluation on two standard retrieval datasets: CUB-200–2011 and Stanford CAR196. Proxy-anchor loss achieves state-of-the-art results.

Recall@K (%) on the CUB-200–2011 and Cars-196 datasets. Superscripts denote embedding sizes and † indicates models using larger input images. Backbone networks of the models are denoted by abbreviations: G–GoogleNet, BN–Inception with batch normalization, R50–ResNet50.

My Comments

The official repository of the paper is available on Github. The code implementation is well-organized and uses the PyTorch metric learning library.
The paper is well-written and easy to read.
The paper claims that the proxy-anchor loss is robust against noisy labels and outliers. Yet, this claim is neither supported nor rejected by any experiments in the paper.
The paper assumes a single proxy (class representative) per class. It is a valid assumption on datasets with small intra-class variations. Yet, large intra-class variations break this assumption. This problem is clear when working with imbalanced datasets. Should a minority class have the same number of class representatives as a majority class?

References

[1] Proxy Anchor Loss for Deep Metric Learning.

[2] No-fuss distance metric learning using proxies.

Proxy Anchor Loss for Deep Metric Learning

Written by Ahmed Taha