Conditional Similarity Networks— CSN

Ahmed Taha
6 min readSep 17, 2018

Imagine the following question; given a dataset of fashion products, find the most similar items for a query fashion item — e.g. green, female, short jeans? Without explicit similarity criteria, this question can be confusing. The most similar items in terms of color, gender, style or fabric are different. If the dataset is a few items big, a human can compromise and rank items according to an acceptable similarity combination. This space embedding and retrieval problem is an important machine learning problem.

Most retrieval literature address embedding from a single similarity metric perspective. Using an explicit similarity metric, e.g. color, off-the-shelf machine learning approaches like contrastive loss or triplet loss can solve this question. Advanced, in terms of complexity and accuracy, approaches like quadruplet loss and quintuplet loss can be adapted as well; yet these systems are usually trained on a single metric. To support multiple similarity metrics, train multiple systems. In the previous example, with four similarity metrics (color, gender, style, fabric), four different systems can be trained independently.

This paper proposes a single network capable of handling multiple-similarity metrics — i.e. no need to train multiple networks. The paper claims that a single network trained on multiple similarity metrics can outperform N specialized networks each trained for a single similarity metric. The core idea is to train an initial embedding, then mask features to get various embedding dependent on the trained masks.

The following figure illustrates the pipeline proposed; a CNN learns objects initial embeddings —highlighted in orange. Multiple masks are applied to the initial embedding to get multiple embedding for multiple similarity metrics. In the figure, yellow, blue, green can filter color, gender, style features respectively. Masks can be manually tweaked or trained using standard deep network optimizers — both approaches evaluated in the paper. In the figure, binary masks are utilized for simplicity purpose.

Conditional Similarity Networks (CSN) Pipeline

Two datasets with multiple similarity metrics are leveraged to train and evaluate the proposed pipeline. The fonts dataset contains single characters in grayscale and defines similarity based on font-style or character type. The second Zappos50k shoe dataset defines similarity based on four metrics: The type of the shoes (i.e., shoes, boots, sandals or slippers), the suggested gender of the shoes (i.e., for women, men, girls or boys), the height of the shoes’ heels (numerical measurements from 0 to 5 inches) and the closing mechanism of the shoes (buckle, pull on, slip on, hook and loop or laced up). Datasets with multiple similarity metrics are rare; so these are valuable to those working on embedding and retrieval problems.

The following figure shows an overview of the whole architecture. The shirt and high-heel have the same color, while the high-heel and sneaker belong to the same fashion category.

The proposed Conditional Similarity Network consists of three key components: First, a learned convolutional neural network as a feature extractor that learns the disentangled embedding, i.e., different dimensions encode features for specific notions of similarity. Second, a condition that encodes according to which visual concept images should be compared. Third, a learned masking operation that, given the condition, selects the relevant embedding dimensions that induce a subspace which encodes the queried visual concept.

The following figure shows 2D embedding for shoes in two spaces: closure mechanism and shoes’ category.

Visualization of 2D embeddings of subspaces learned by the CSN. The spaces are clearly organized according to (a) closure mechanism of the shoes and (b) the category of the shoes. This shows that CSNs can successfully separate the subspaces.

The proposed pipeline is evaluated against a standard triplet network for multiple similarity metrics, a set of N-specialized triplet network each trained for a single similarity metric. Two variants of conditional similarity networks are proposed: fixed disjoint masks where each dimension encode a specific notion of similarity; learned joint masks where learnable masks select features relevant to the respective notion of similarity. The following figure presents the evaluated models

The Standard triplet network is trained with multiple different similarity metrics— i.e. Given (blue sneakers, blue heels, red sneaker, similarity_metric=color), the network should embed the blue sneaker closer to blue heels, yet the same network should embed the blue sneaker closer to red sneaker if the similarity_metric=style. Given that the similarity_metric is not available at test time, this becomes a nonsense baseline. I feel the need to explicitly highlight this multiple times in this article.

These four networks are quantitatively evaluated on the Zappos50k shoe dataset. The standard triplet network is better by 25% than random. It probably achieved such accuracy due to a scare number of overlapping triplets i.e. (O1,O2,O3, similarity_metric=i) and (O1,O2,O3, similarity_metric=j) where i not equal j.

The N specialized triplet networks did much better — as expected. Yet, the requirement to train N different networks, without sharing weights, is a repelling consequence. I find it quite interesting that CSN variants beat the N different networks by a small margin. The solution main quality is having a smaller number of trainable parameters. The following figure visualizes the learned masks for the standard triplet network, fixed disjoint mask, and learned joint mask CSN variants.

Visualization of the masks: Left: In standard triplet networks, each dimension is equally taken into account for each triplet. Center: The Conditional Similarity Network allows to focus on a subset of the embedding to answer a triplet question. Here, each mask focuses on one fourth. Right: For learned masks, it is evident that the model learns to switch off different dimensions per question. Further, a small subset is shared across tasks.

The very sparse learned masks are reported to be “interesting” and confirm that the concepts are low-dimensional. Additional experiment evaluates the models’ robustness with a various number of triplets to highlight the benefits of joint learning. The figure below shows the experiment results. Again, the benefits are marginal; I acknowledge achieving comparable performance with a much smaller number of trainable parameters. Yet, such accuracy margin on a single dataset raises doubts on the better accuracy with joint learning claim.

Triplet prediction performance with respect to the number of unique training triplets available. CSNs with fixed masks consistently outperform the set of specialized triplet networks. CSNs with learned masks generally require more triplets, since they need to learn the embedding as well as the masks. However, when enough triplets are available, they provide the best performance.

My Comments:

  • While I doubt a couple of findings reported in this paper, it is very well written. I recommend reading it.
  • The quantitative results highlight CSN-with learnable masks superiority against the N-specialized triplet networks. Yet, factors like (1)the small margin, (2) reported on a single dataset and (3) early stopping training procedure where the snapshot achieving the highest validation performance is used on the test set, raises my doubts regarding the better accuracy claim.
  • The standard triplet network baseline is nonsense. The N-specialized networks ought to be the baseline. Given the small margin between the N-specialized networks and both CSN variants, I would have trained the N-specialized networks with masks as well. The extra learnable mask weights can explain such a performance gap. Contrary, the fixed disjoint masks could encourage a more focused embedding, in the 1/n part, and thus reduces overfitting chances. These contradicting factors, more learnable weights vs more focused embedding via fixed disjoint mask, are interesting to evaluate. At least it would provide some insights on why the fixed disjoint masks report better in general and especially when training with a smaller number of triplets. To my knowledge, such an explanation is omitted in the paper.
  • The author “interesting” finding regarding the spare mask is surprising to me because their loss function includes an L1 regularizer to encourage sparse mask. Given the fact the L1 are sparse aggressive regularizer, the learned sparse mask is more expected than interesting.
  • The authors deserve a big round of applause for publishing their implementation online. I plan to use it for some of my experiment.