Conditional Similarity Networks— CSN

Conditional Similarity Networks (CSN) Pipeline
The proposed Conditional Similarity Network consists of three key components: First, a learned convolutional neural network as a feature extractor that learns the disentangled embedding, i.e., different dimensions encode features for specific notions of similarity. Second, a condition that encodes according to which visual concept images should be compared. Third, a learned masking operation that, given the condition, selects the relevant embedding dimensions that induce a subspace which encodes the queried visual concept.
Visualization of 2D embeddings of subspaces learned by the CSN. The spaces are clearly organized according to (a) closure mechanism of the shoes and (b) the category of the shoes. This shows that CSNs can successfully separate the subspaces.
Visualization of the masks: Left: In standard triplet networks, each dimension is equally taken into account for each triplet. Center: The Conditional Similarity Network allows to focus on a subset of the embedding to answer a triplet question. Here, each mask focuses on one fourth. Right: For learned masks, it is evident that the model learns to switch off different dimensions per question. Further, a small subset is shared across tasks.
Triplet prediction performance with respect to the number of unique training triplets available. CSNs with fixed masks consistently outperform the set of specialized triplet networks. CSNs with learned masks generally require more triplets, since they need to learn the embedding as well as the masks. However, when enough triplets are available, they provide the best performance.
  • While I doubt a couple of findings reported in this paper, it is very well written. I recommend reading it.
  • The quantitative results highlight CSN-with learnable masks superiority against the N-specialized triplet networks. Yet, factors like (1)the small margin, (2) reported on a single dataset and (3) early stopping training procedure where the snapshot achieving the highest validation performance is used on the test set, raises my doubts regarding the better accuracy claim.
  • The standard triplet network baseline is nonsense. The N-specialized networks ought to be the baseline. Given the small margin between the N-specialized networks and both CSN variants, I would have trained the N-specialized networks with masks as well. The extra learnable mask weights can explain such a performance gap. Contrary, the fixed disjoint masks could encourage a more focused embedding, in the 1/n part, and thus reduces overfitting chances. These contradicting factors, more learnable weights vs more focused embedding via fixed disjoint mask, are interesting to evaluate. At least it would provide some insights on why the fixed disjoint masks report better in general and especially when training with a smaller number of triplets. To my knowledge, such an explanation is omitted in the paper.
  • The author “interesting” finding regarding the spare mask is surprising to me because their loss function includes an L1 regularizer to encourage sparse mask. Given the fact the L1 are sparse aggressive regularizer, the learned sparse mask is more expected than interesting.
  • The authors deserve a big round of applause for publishing their implementation online. I plan to use it for some of my experiment.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.