Feature Embedding Regularizers: SVMax & VICReg

What is more important, a deep network weights or its activations? Obviously, we can derive the network’s activation from its weights. Yet, deep networks are non-linear embedding functions; we want this non-linear embedding only. On top of this embedding, we either slap a linear classifier in a classification network or compute similarity in a retrieval network. So, it is surprising that feature embedding regularizers are rarely used in the literature compared to the weight-decay regularizer. The weight-decay regularizer can impact a network’s performance significantly especially on small datasets[3]. Similarly, feature embedding can bring a significant impact, e.g., avoid model collapse. In this article, I will present two related feature embedding regularizers: SVMax [1] and VICReg [2].

Both SVMax and VICReg are unsupervised regularizers. So, they support supervised and un/self-supervised learning. They both work on individual mini-batches during training. Thus, no dataset curating or preprocessing is required. I will the same notation to describe both. We have a network N, that takes an input mini-batch of size b and generates a d-dimensional embedding, i.e., we have an output feature embedding matrix E ∈ R^{b × d} as shown in Fig. 1. The matrix E can be extracted from any network layer, but it is typically extracted from the network’s penultimate layer, i.e., after the global average pooling layer.

Figure 1: During training, a network N generates a feature embedding matrix E ∈ R^{b × d} for a mini-batch of size b.

Both SVMax and VICReg regularize the feature embedding output explicitly, which implicitly regularize the network’s weights. For a d-dimensional feature embedding, both SVMax and VICReg aim to activate all dimensions. Put another way, both regularizers aim to get each neuron (dimension) to fire equally likely. By doing so, we avoid a model collapse where certain dimensions (neurons) are always active/inactive independent of the input.

SVMax [1] has been proposed for metric learning where the feature embedding is normalized on the unit circle, i.e., l2-normalized. Accordingly, SVMax aims to scatter the feature embedding uniformly on the unit-circle as shown in Fig. 2 (Right). In this figure, there is a significant difference between the singular values for the rectangular matrix E. When the features are polarized across a single or a few dimensions as shown in Fig. 2 (Left), a single or a few singular values are large while the rest of small. Conversely, when the features scatter uniformly, all dimensions become active and all singular values increase, i.e., the mean singular value increases.

Figure 2: Feature embeddings scattered over the 2D unit circle. In (a), the features are polarized across a single axis; the singular value of the principal (horizontal) axis is large, while the singular value of the secondary (vertical) axis is small. In (b), the features spread uniformly across both dimensions; both singular values are comparably large.

SVMax capitalizes on this observation and regularizes E to maximize its mean singular value. In its simplest form, SVMax is formulated as follows

Figure 3: Vanilla SVMax formulation. L_r is the original loss function before using the SVMax regularizer, while s_μ is the mean singular value to be maximized.

where s_μ is the mean singular value to be maximized, and L_r is the original loss function (e.g., cross-entropy).

Yet, SVMax further exploits the unit-circle (l2-normalization) constraints to establish rigid lower and upper bounds on the mean singular value s_μ. For instance, a lower bound on s_μ holds when matrix E has rank one, i.e., Rank(E)=1. This is a clear case of a model collapse where a single dimension is always active. In such case, the lower bound of s_μ equals

Figure 4: A lower bound on the mean singular value holds when all singular values equal zero except the first — largest — singular value. s^\ast(E) is the value of the largest singular value when all other singular values equal zero.

Where ||E||_1 and ||E||_∞ are the L-1 norm and L-Infinity norm, respectively. Similarly, SVMax establishes an upper bound on s_μ as follows

Figure 5: An upper bound on the mean singular value established using the nuclear norm ||E||_* and the Frobenius Norm ||E||_F.

These bounds bring two benefits: (1) It is easy to tune SVMax’s balancing hyperparameter λ (Fig. 3) as the range of s_μ is known before training starts; (2) The mean singular value and its bounds serve as a quantitative metric to evaluate networks after training — including un-regularized networks. For instance, Fig. 6 evaluates four networks trained with different batch sizes. For each network, the mean singular value is computed on the test split, i.e., post-training evaluation. The networks trained with SVMax utilize the feature embedding significantly better compared to the un-regularized networks.

Figure 6: The mean singular values of four different feature embedding (metric learning) networks. The X and Y-axes denote the mini-batch size b and the s_μ of the feature embedding of CUB-200’s test split. The feature embedding is learned using a contrastive loss with and without SVMax. The horizontal red line denotes the upper bound on s_μ.

Despite its simplicity and rigid mathematical bounds, SVMax is computationally expensive. The computational complexity of the mean singular value increases as the matrix dimensions increase. And this is where VICReg comes to the rescue with a cheaper formulation but few caveats.

VICReg [2] has been proposed for self-supervised learning where the feature embedding is not necessarily normalized. VICReg has three terms, but this article will focus on a single term — the Variance term. This term aims to activate each dimension in the feature embedding matrix E. To do so, VICReg computes the standard deviation (std) for E across the mini-batch as shown in Fig. 7. This generates a vector with d-dimensions, each denoting the activity of a single dimension. A dimension with a zero standard deviation is a collapsed dimension — the dimension is always on/off.

Figure 7: Given a feature embedding matrix E ∈ R^{b × d}, VICReg computes a standard deviation vector S with d dimensions. The standard deviation serves as a metric for evaluating the dimension’s activity. A dimension with zero standard deviation is a collapsed dimension.

The variance term in VICReg is formulated as follows

Figure 8: The variance term in VICReg computes the standard deviation (std) of each d-dimension in the feature embedding Matrix E. Then, VICReg encourages the std to be γ. is a small scalar preventing numerical instabilities.

where γ is a hyperparameter that indicates the desired standard deviation per dimension and ϵ is a small scalar for numerical instabilities.

This formulation encourages the standard deviation to equal γ along each dimension. This ought to prevent collapse from all inputs mapped on the same vector. With the embedding unnormalized, VICReg can’t make any assumptions on the range or bounds of the standard deviation term. VICReg has two hyperparameters: λ as with SVMax (Fig. 3) and γ.

For qualitative evaluation, both SVMax and VICReg mitigate model collapse without explicit negative sampling. Both regularizers converge to a competitive feature embedding without training tricks such as output quantization, stop gradient, memory banks, etc. Both regularizer report quantitative evaluation on self-supervised learning benchmarks. In these benchmarks, a linear ImageNet classifier is trained on top of a frozen pre-trained network. Tab. 1 and Tab. 2 present a quantitative evaluation for SVMax and VICReg. Both papers come from different organizations with different computing capabilities. Thus, SVMax evaluation is primitive, while VICReg is extensive and up-to-date.

Table 1: Quantitative SVMax evaluation using self-supervised learning with an AlexNet backbone. We evaluate the pre-trained network N through ImageNet classification with a linear classifier on top of frozen convolutional layers. For every layer, the convolutional features are spatially resized until there are fewer than 10K dimensions left. A fully connected layer followed by softmax is trained on a 1000-way object classification task.
Table 2: Evaluation of the representations obtained with a ResNet-50 backbone pretrained with VICReg using: (1) linear classification on top of the frozen representations from ImageNet; (2) semi-supervised classification on top of the fine-tuned representations from 1% and 10% of ImageNet samples. We report Top-1 and Top-5 accuracies (in %). Top-3 best self-supervised methods are underlined.

While VICReg focuses on self-supervised learning and the model collapse problem, SVMax delivers further evaluations using supervised metric learning. While SVMax does not achieve state-of-the-art results in metric learning, it delivers superior performance when hyperparameters are not tuned. For instance, when trained with a large learning rate (lr), metric learning methods learn an inferior embedding and diverge. Conversely, SVMax makes these supervised methods more resilient especially with large learning rates as shown in Fig. 9.

Figure 9: . Quantitative evaluation on Stanford CARS196. X and Y-axis denote the learning rate and recall@1 performance, respectively.

Final Thoughts:

  • Both SVMax and VICReg are well-written and well-motivated papers. Both are unsupervised and support various network architectures and tasks. Each delivers a ton of experiments that are impossible to be cover in this article. I highly recommend these papers for those interesting in feature embedding literature. PyTorch implementations are available for both SVMax and VICReg.
  • Compared to VICReg, the SVMax paper is easier to read as it focuses on a single idea. In contrast, VICReg presents multiple terms and one of these terms is borrowed from another paper, the Barlow twins paper [4]
  • Compared to SVMax, VICReg delivers a ton of quantitative evaluation on recent benchmarks. FAIR has the GPUs :)
  • Regarding weight-decay vs. feature embedding regularizers, both SVMax and VICReg regularize the output of a single layer. In contrast, weight-decay is always applied to all network weights (layers). Accordingly, I wish a paper evaluates the impact of these feature embedding regularizers when applied on all layers. As mentioned previously, weight-decay had a significant impact in [3] and I wonder if feature regularizers have a similar impact.

References:

[1] Taha, A., Hanson, A., Shrivastava, A. and Davis, L., 2021. SVMax: A Feature Embedding Regularizer.

[2] Bardes, A., Ponce, J. and LeCun, Y., 2021. Vicreg: Variance-invariance-covariance regularization for self-supervised learning.

[3] Power, A., Burda, Y., Edwards, H., Babuschkin, I. and Misra, V., 2021. Grokking: Generalization beyond overfitting on small algorithmic datasets.

[4] Zbontar, J., Jing, L., Misra, I., LeCun, Y. and Deny, S., 2021. Barlow twins: Self-supervised learning via redundancy reduction.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.