FaceNet: A Unified Embedding for Face Recognition and Clustering

FaceNet is an embedding learning framework for face verification, recognition/classification and clustering. The framework is evaluated on human faces, by verifying if two faces belong to the same person and grouping faces that belong to same person like in Google Picasa. The paper focuses on triplet loss as the main contribution. Different embedding networks, like inception and AlexNet variants, are evaluated.

The main take home message: Triplet loss learns an embedding for classification and clustering. Caveats, explicitly mentioned in the paper, are requirement for big training batches, long training duration, and most importantly the required for advanced training triplet selection. To speed convergence, the training batch triplets should contain both hard positive and negative.

A training triplet contains three exemplars (A,P,N): Anchor, positive and negative. Any triplet loss embedding network objective is to learn an embedding such that (||F(A)-F(P)||+margin) < ||F(A)-F(N)||

Embedding network objective is to keep (A,P) embedding closer than (A,N) embedding

Training triplets selection affects the network convergence. For instance, if the training triplets already satisfy the embedding constraint, the network will not learn anything. This is the common case when using random sampling. Thus, it is important to select triplets that violate such constraint to speed learning.

On a large dataset, selecting hard positives and negatives is computational expensive. Thus, big batches are used and all anchor-positive pairs in a “mini”-batch are utilized to avoid expensive hard positive selection. Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x) = 0). To avoid that, a semi-hard concept is introduced. Instead of selecting hard-negatives that are closer than positive exemplars, semi-hard negatives are selected that are further from positive but within the banned margin.

These two tricks, big-batch and semi-hard selection, improve the embedding network convergence.

While triplet loss is the paper main focus, six embedding networks are evaluated. NN1 is a variation of AlexNet, the rest NN2 ,…, NNS2 are Inception net variants. NNSX networks are small inception models to work on mobile phones. They are computational cheap in terms of memory and processing requirement but of course lag in terms of accuracy.

Table 4 shows the NN1 performance (AlexNet variant) at different jpg quality and image size. It illustrates the approach robustness and elegant degradation at low image quality or small thumbnails.

Table 5 shows the evaluation for different embedding dimensions. 128-bytes perform best and thus adapted. During training a 128 dimensional
float vector is used, but it can be quantized to 128-bytes without loss of accuracy.

Table 6 and 7 show impressive qualitative results.

My Comments:

  • I like the paper and enjoyed reading it.
  • In the paper, semi-hard negative selection process is not fully illustrated. Does the constraint relaxation means that random sampling will work?
  • The batch size used is 1800! if that's the number of triplets, This is big and might not be feasible on common GPUs. I am not sure how such big batch works for medium-size datasets?

I write reviews on computer vision papers. Writing tips are welcomed.