This is the second part of a ranking-losses survey. The first part covers the contrastive and triplet losses. In this part, N-pairs and angular losses are presented.
 N-Pairs Loss
Both contrastive and triplet losses leverage the Euclidean distance to quantify the similarity between points. In addition, every anchor, in the training mini-batch, is paired with a single negative example. The N-Pairs loss changes these two assumptions. First, It uses the cosine similarity to quantify the similarity between points. Thus, The N-pairs loss compares embeddings using the angle between two vectors, and not the norm. This is not dramatic, so its formulation remains similar to the triplet loss for a single triplet (a,p,n) as follows
However, the core idea of N-pairs loss is pairing every anchor with a single positive and every negative in the batch as follows
For N-pairs, a training batch contains a single positive pair from each class. Thus, a mini-batch, of size B, will have B//2 positive pairs and every anchor is paired with (B-2) negatives as shown in the next figure.
The N-pairs’ intuition is to leverage all negatives within a batch to guide the gradient update which speeds convergence.
The N-pairs’ loss is generally superior to triplet loss but with few caveats. The training mini-batch size is upper bounded by the number of training classes because only a single positive pair is allowed per class. In contrast, the triplet loss and contrastive loss mini-batches’ size are only limited by the GPU’s memory. In addition, the N-pairs loss learns an un-normalized embedding. This has two consequences: (1) margin between different classes is defined using an angle theta, (2) to avoid a degenerate embedding that grows to infinity, a regularizer, to constrain the embedding space, is required.
 Angular Loss
Angular loss tackles two limitations in triplet loss. First, triplet loss assumes a fixed margin m between different classes. A fixed margin is undesirable because different classes have different intra-class variations as shown in the next figure
The second limitation is how triplet loss formulates the gradient of the negative point. The next figure shows why the direction for the negative gradient may not be optimal, i.e., no guarantees for moving away from the positive class’s center.
To tackle both limitations, the authors propose to use the angle at n instead of the margin m and correct the gradient at the negative point x_n. Instead of pushing points based on distance, the goal is to minimize the angle at n, i.e., make the triangle a-n-b pointy at n. The next figure illustrates how the angular loss formulation pushes the negative point x_n away from xc, the center of the local cluster defined by x_a and x_p. In addition, the anchor x_a and the positive x_p are dragged towards each other.
Compared to the original triplet loss whose gradients only depend on two points (e.g., grad = x_a - x_n), the angular loss gradients are much more robust as they consider all three points simultaneously. Also, please note that compared to distance-based metrics, manipulating the angle n’ is not only rotation-invariant but also scale-invariant by nature.
N-pairs and Angular loss are generally superior to vanilla Triplet loss. However, there are important parameters to consider when comparing these approaches.
- The sampling strategy used to train triplet loss can lead to significant performance differences. Hard Mining is efficient and converges faster if a model collapse is avoided.
- The nature of the training dataset is another important factor. When working with person re-identification or face clustering, we can assume that each class is represented by a single cluster, i.e., a single-mode with small intra-class variations. Yet, some retrieval datasets like CUB-200–2011 and Stanford Online Products have a lot of intra-class variations. Empirically, hard triplet loss works better for person/face re-identification tasks while N-pairs and Angular losses work better on CUB-200 and Stanford Online Product datasets.
- When approaching a new retrieval task and tuning the hyper-parameters (learning rate and batch_size) for a new training dataset, I found semi-hard triplet loss to be the most stable. It does not achieve the best performance but it is the least likely to degenerate.
If there are other ranking losses that are worth mentioning, please leave it in the comments. If the list gets big, I will write a followup.