Imagine the following question; given a dataset of fashion products, find the most similar items for a query fashion item — e.g. green, female, short jeans? Without explicit similarity criteria, this question can be confusing. The most similar items in terms of color, gender, style or fabric are different. If the dataset is a few items big, a human can compromise and rank items according to an acceptable similarity combination. This space embedding and retrieval problem is an important machine learning problem.
Most retrieval literature address embedding from a single similarity metric perspective. Using an explicit similarity metric, e.g. color, off-the-shelf machine learning approaches like contrastive loss or triplet loss can solve this question. Advanced, in terms of complexity and accuracy, approaches like quadruplet loss and quintuplet loss can be adapted as well; yet these systems are usually trained on a single metric. To support multiple similarity metrics, train multiple systems. In the previous example, with four similarity metrics (color, gender, style, fabric), four different systems can be trained independently.
This paper proposes a single network capable of handling multiple-similarity metrics — i.e. no need to train multiple networks. The paper claims that a single network trained on multiple similarity metrics can outperform N specialized networks each trained for a single similarity metric. The core idea is to train an initial embedding, then mask features to get various embedding dependent on the trained masks.
The following figure illustrates the pipeline proposed; a CNN learns objects initial embeddings —highlighted in orange. Multiple masks are applied to the initial embedding to get multiple embedding for multiple similarity metrics. In the figure, yellow, blue, green can filter color, gender, style features respectively. Masks can be manually tweaked or trained using standard deep network optimizers — both approaches evaluated in the paper. In the figure, binary masks are utilized for simplicity purpose.
Two datasets with multiple similarity metrics are leveraged to train and evaluate the proposed pipeline. The fonts dataset contains single characters in grayscale and defines similarity based on font-style or character type. The second Zappos50k shoe dataset defines similarity based on four metrics: The type of the shoes (i.e., shoes, boots, sandals or slippers), the suggested gender of the shoes (i.e., for women, men, girls or boys), the height of the shoes’ heels (numerical measurements from 0 to 5 inches) and the closing mechanism of the shoes (buckle, pull on, slip on, hook and loop or laced up). Datasets with multiple similarity metrics are rare; so these are valuable to those working on embedding and retrieval problems.
The following figure shows an overview of the whole architecture. The shirt and high-heel have the same color, while the high-heel and sneaker belong to the same fashion category.
The following figure shows 2D embedding for shoes in two spaces: closure mechanism and shoes’ category.
The proposed pipeline is evaluated against a standard triplet network for multiple similarity metrics, a set of N-specialized triplet network each trained for a single similarity metric. Two variants of conditional similarity networks are proposed: fixed disjoint masks where each dimension encode a specific notion of similarity; learned joint masks where learnable masks select features relevant to the respective notion of similarity. The following figure presents the evaluated models
The Standard triplet network is trained with multiple different similarity metrics— i.e. Given (blue sneakers, blue heels, red sneaker, similarity_metric=color), the network should embed the blue sneaker closer to blue heels, yet the same network should embed the blue sneaker closer to red sneaker if the similarity_metric=style. Given that the similarity_metric is not available at test time, this becomes a nonsense baseline. I feel the need to explicitly highlight this multiple times in this article.
These four networks are quantitatively evaluated on the Zappos50k shoe dataset. The standard triplet network is better by 25% than random. It probably achieved such accuracy due to a scare number of overlapping triplets i.e. (O1,O2,O3, similarity_metric=i) and (O1,O2,O3, similarity_metric=j) where i not equal j.
The N specialized triplet networks did much better — as expected. Yet, the requirement to train N different networks, without sharing weights, is a repelling consequence. I find it quite interesting that CSN variants beat the N different networks by a small margin. The solution main quality is having a smaller number of trainable parameters. The following figure visualizes the learned masks for the standard triplet network, fixed disjoint mask, and learned joint mask CSN variants.
The very sparse learned masks are reported to be “interesting” and confirm that the concepts are low-dimensional. Additional experiment evaluates the models’ robustness with a various number of triplets to highlight the benefits of joint learning. The figure below shows the experiment results. Again, the benefits are marginal; I acknowledge achieving comparable performance with a much smaller number of trainable parameters. Yet, such accuracy margin on a single dataset raises doubts on the better accuracy with joint learning claim.
- While I doubt a couple of findings reported in this paper, it is very well written. I recommend reading it.
- The quantitative results highlight CSN-with learnable masks superiority against the N-specialized triplet networks. Yet, factors like (1)the small margin, (2) reported on a single dataset and (3) early stopping training procedure where the snapshot achieving the highest validation performance is used on the test set, raises my doubts regarding the better accuracy claim.
- The standard triplet network baseline is nonsense. The N-specialized networks ought to be the baseline. Given the small margin between the N-specialized networks and both CSN variants, I would have trained the N-specialized networks with masks as well. The extra learnable mask weights can explain such a performance gap. Contrary, the fixed disjoint masks could encourage a more focused embedding, in the 1/n part, and thus reduces overfitting chances. These contradicting factors, more learnable weights vs more focused embedding via fixed disjoint mask, are interesting to evaluate. At least it would provide some insights on why the fixed disjoint masks report better in general and especially when training with a smaller number of triplets. To my knowledge, such an explanation is omitted in the paper.
- The author “interesting” finding regarding the spare mask is surprising to me because their loss function includes an L1 regularizer to encourage sparse mask. Given the fact the L1 are sparse aggressive regularizer, the learned sparse mask is more expected than interesting.
- The authors deserve a big round of applause for publishing their implementation online. I plan to use it for some of my experiment.