Feature Embedding Regularizers: SVMax & VICReg

Figure 1: During training, a network N generates a feature embedding matrix E ∈ R^{b × d} for a mini-batch of size b.
Figure 2: Feature embeddings scattered over the 2D unit circle. In (a), the features are polarized across a single axis; the singular value of the principal (horizontal) axis is large, while the singular value of the secondary (vertical) axis is small. In (b), the features spread uniformly across both dimensions; both singular values are comparably large.
Figure 3: Vanilla SVMax formulation. L_r is the original loss function before using the SVMax regularizer, while s_μ is the mean singular value to be maximized.
Figure 4: A lower bound on the mean singular value holds when all singular values equal zero except the first — largest — singular value. s^\ast(E) is the value of the largest singular value when all other singular values equal zero.
Figure 5: An upper bound on the mean singular value established using the nuclear norm ||E||_* and the Frobenius Norm ||E||_F.
Figure 6: The mean singular values of four different feature embedding (metric learning) networks. The X and Y-axes denote the mini-batch size b and the s_μ of the feature embedding of CUB-200’s test split. The feature embedding is learned using a contrastive loss with and without SVMax. The horizontal red line denotes the upper bound on s_μ.
Figure 7: Given a feature embedding matrix E ∈ R^{b × d}, VICReg computes a standard deviation vector S with d dimensions. The standard deviation serves as a metric for evaluating the dimension’s activity. A dimension with zero standard deviation is a collapsed dimension.
Figure 8: The variance term in VICReg computes the standard deviation (std) of each d-dimension in the feature embedding Matrix E. Then, VICReg encourages the std to be γ. ϵ is a small scalar preventing numerical instabilities.
Table 1: Quantitative SVMax evaluation using self-supervised learning with an AlexNet backbone. We evaluate the pre-trained network N through ImageNet classification with a linear classifier on top of frozen convolutional layers. For every layer, the convolutional features are spatially resized until there are fewer than 10K dimensions left. A fully connected layer followed by softmax is trained on a 1000-way object classification task.
Table 2: Evaluation of the representations obtained with a ResNet-50 backbone pretrained with VICReg using: (1) linear classification on top of the frozen representations from ImageNet; (2) semi-supervised classification on top of the fine-tuned representations from 1% and 10% of ImageNet samples. We report Top-1 and Top-5 accuracies (in %). Top-3 best self-supervised methods are underlined.
Figure 9: . Quantitative evaluation on Stanford CARS196. X and Y-axis denote the learning rate lr and recall@1 performance, respectively.
  • Both SVMax and VICReg are well-written and well-motivated papers. Both are unsupervised and support various network architectures and tasks. Each delivers a ton of experiments that are impossible to be cover in this article. I highly recommend these papers for those interesting in feature embedding literature. PyTorch implementations are available for both SVMax and VICReg.
  • Compared to VICReg, the SVMax paper is easier to read as it focuses on a single idea. In contrast, VICReg presents multiple terms and one of these terms is borrowed from another paper, the Barlow twins paper [4]
  • Compared to SVMax, VICReg delivers a ton of quantitative evaluation on recent benchmarks. FAIR has the GPUs :)
  • Regarding weight-decay vs. feature embedding regularizers, both SVMax and VICReg regularize the output of a single layer. In contrast, weight-decay is always applied to all network weights (layers). Accordingly, I wish a paper evaluates the impact of these feature embedding regularizers when applied on all layers. As mentioned previously, weight-decay had a significant impact in [3] and I wonder if feature regularizers have a similar impact.




I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Image Representation Technique for Visual Search on a C2C marketplace

Conceptualizing the Knowledge Graph Construction Pipeline

Medical image clustering for risk stratification

LSTM and Bidirectional LSTM for Regression

Breaking down the concept of Activation functions in Deep Learning

MANTA – Distributed AutoML User Guide

Convolutional Neural Networks (CNNs)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

More from Medium

Paper Summary [Deep Deterministic Uncertainty for Semantic Segmentation]

Different Transformer Models

Review — A ConvNet for the 2020s

How is the Transformer different from Deep Neural Networks?