Deep Metric Learning Beyond Binary Supervision

(a) Existing methods categorize neighbors into positive and negative classes, and learn a metric space where positive images are close to the anchor and negative ones far apart. In such a space, the distance between a pair of images is not necessarily related to their semantic similarity since the order and degrees of similarities between them are disregarded. (b) This paper [1] allows distance ratios in the label space to be preserved in the learned metric space to overcome the aforementioned limitation.
Triplet and Log-ratio losses. f indicates an embedding vector, y is a continuous label, D(·) denotes the squared Euclidean distance, m is a margin, and [·]+ denotes the hinge function. The embedding vectors are L2 normalized to avoid magnitudes’ divergence, i.e., the margin becomes trivial.
Qualitative results of human pose retrieval.
Qualitative results of caption-aware image retrieval. Binary Tri.: L(Triplet)+M(Binary). ImgNet: ImageNet pretraiend ResNet101.
Performance versus embedding dimensionality.
  • The paper is well written and brings a novel perspective to the metric learning community.
  • The authors released their code to Github.
  • I am not sure why the authors use the log-ratio, and not just the ratio. Maybe because the feature embedding is L2-normalized. The maximum distance in an L2-normalized embedding is two, i.e., max(D(f_j,f_i))) = 2. Accordingly, the ratio between (y_i,y_j) needs normalization to a similar range.
  • In metric learning, a large embedding dimension is vital for achieving superior performance. I am not aware of any paper that extensively studies the relationship between performance and embedding dimension: (1)why it happens? is it due to the L2 normalization? (2) how to mitigate it? Accordingly, the performance vs. embedding dimension evaluation (last figure) is interesting. I wish the authors had elaborated more on this.
  • The authors supported log-ratio with a dedicated mini-batch mining strategy, dense-triplet-mining (DTM). The log-ratio loss supports random sampling. Thus, DTM is justified only if the random sampling is inferior. In Figure 3, the authors reported triplet loss performance with and without DTM. Yet, the log-ratio performance is never reported without DTM! Besides, DTM is not cheap. DTM loads the k nearest neighbors (kNNs) for every anchor. Finding these kNNs is not trivial for large datasets. An advanced data structure (e.g., Range Trees) is required to find these kNNs efficiently.

--

--

--

I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

All you need to know about Computer Vision

Night 1: Machine Learning

Netflix Million Dollar Challenge: Understanding the algorithm that won a million dollars

Cross-gen Spiderman Differtiator

Using Smart Document Understanding in IBM Watson Discovery

Creating Model | Compute & Reduce Loss: Machine Learning Part 5

Establishing Ground Truth in the Real World

Quality assurance in motion detection

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

More from Medium

Why is deep learning “deep”?

Why Batch Normalization works

Transformer’s Training Details

Review — SimCLR: A Simple Framework for Contrastive Learning of Visual Representations