Feature Embedding Regularizers: SVMax & VICReg
What is more important, a deep network weights or its activations? Obviously, we can derive the network’s activation from its weights. Yet, deep networks are non-linear embedding functions; we want this non-linear embedding only. On top of this embedding, we either slap a linear classifier in a classification network or compute similarity in a retrieval network. So, it is surprising that feature embedding regularizers are rarely used in the literature compared to the weight-decay regularizer. The weight-decay regularizer can impact a network’s performance significantly especially on small datasets[3]. Similarly, feature embedding can bring a significant impact, e.g., avoid model collapse. In this article, I will present two related feature embedding regularizers: SVMax [1] and VICReg [2].
Both SVMax and VICReg are unsupervised regularizers. So, they support supervised and un/self-supervised learning. They both work on individual mini-batches during training. Thus, no dataset curating or preprocessing is required. I will the same notation to describe both. We have a network N, that takes an input mini-batch of size b and generates a d-dimensional embedding, i.e., we have an output feature embedding matrix E ∈ R^{b × d} as shown in Fig. 1. The matrix E can be extracted from any network layer, but it is typically extracted from the network’s penultimate layer, i.e., after the global average pooling layer.
Both SVMax and VICReg regularize the feature embedding output explicitly, which implicitly regularize the network’s weights. For a d-dimensional feature embedding, both SVMax and VICReg aim to activate all dimensions. Put another way, both regularizers aim to get each neuron (dimension) to fire equally likely. By doing so, we avoid a model collapse where certain dimensions (neurons) are always active/inactive independent of the input.
SVMax [1] has been proposed for metric learning where the feature embedding is normalized on the unit circle, i.e., l2-normalized. Accordingly, SVMax aims to scatter the feature embedding uniformly on the unit-circle as shown in Fig. 2 (Right). In this figure, there is a significant difference between the singular values for the rectangular matrix E. When the features are polarized across a single or a few dimensions as shown in Fig. 2 (Left), a single or a few singular values are large while the rest of small. Conversely, when the features scatter uniformly, all dimensions become active and all singular values increase, i.e., the mean singular value increases.
SVMax capitalizes on this observation and regularizes E to maximize its mean singular value. In its simplest form, SVMax is formulated as follows
where s_μ is the mean singular value to be maximized, and L_r is the original loss function (e.g., cross-entropy).
Yet, SVMax further exploits the unit-circle (l2-normalization) constraints to establish rigid lower and upper bounds on the mean singular value s_μ. For instance, a lower bound on s_μ holds when matrix E has rank one, i.e., Rank(E)=1. This is a clear case of a model collapse where a single dimension is always active. In such case, the lower bound of s_μ equals
Where ||E||_1 and ||E||_∞ are the L-1 norm and L-Infinity norm, respectively. Similarly, SVMax establishes an upper bound on s_μ as follows
These bounds bring two benefits: (1) It is easy to tune SVMax’s balancing hyperparameter λ (Fig. 3) as the range of s_μ is known before training starts; (2) The mean singular value and its bounds serve as a quantitative metric to evaluate networks after training — including un-regularized networks. For instance, Fig. 6 evaluates four networks trained with different batch sizes. For each network, the mean singular value is computed on the test split, i.e., post-training evaluation. The networks trained with SVMax utilize the feature embedding significantly better compared to the un-regularized networks.
Despite its simplicity and rigid mathematical bounds, SVMax is computationally expensive. The computational complexity of the mean singular value increases as the matrix dimensions increase. And this is where VICReg comes to the rescue with a cheaper formulation but few caveats.
VICReg [2] has been proposed for self-supervised learning where the feature embedding is not necessarily normalized. VICReg has three terms, but this article will focus on a single term — the Variance term. This term aims to activate each dimension in the feature embedding matrix E. To do so, VICReg computes the standard deviation (std) for E across the mini-batch as shown in Fig. 7. This generates a vector with d-dimensions, each denoting the activity of a single dimension. A dimension with a zero standard deviation is a collapsed dimension — the dimension is always on/off.
The variance term in VICReg is formulated as follows
where γ is a hyperparameter that indicates the desired standard deviation per dimension and ϵ is a small scalar for numerical instabilities.
This formulation encourages the standard deviation to equal γ along each dimension. This ought to prevent collapse from all inputs mapped on the same vector. With the embedding unnormalized, VICReg can’t make any assumptions on the range or bounds of the standard deviation term. VICReg has two hyperparameters: λ as with SVMax (Fig. 3) and γ.
For qualitative evaluation, both SVMax and VICReg mitigate model collapse without explicit negative sampling. Both regularizers converge to a competitive feature embedding without training tricks such as output quantization, stop gradient, memory banks, etc. Both regularizer report quantitative evaluation on self-supervised learning benchmarks. In these benchmarks, a linear ImageNet classifier is trained on top of a frozen pre-trained network. Tab. 1 and Tab. 2 present a quantitative evaluation for SVMax and VICReg. Both papers come from different organizations with different computing capabilities. Thus, SVMax evaluation is primitive, while VICReg is extensive and up-to-date.
While VICReg focuses on self-supervised learning and the model collapse problem, SVMax delivers further evaluations using supervised metric learning. While SVMax does not achieve state-of-the-art results in metric learning, it delivers superior performance when hyperparameters are not tuned. For instance, when trained with a large learning rate (lr), metric learning methods learn an inferior embedding and diverge. Conversely, SVMax makes these supervised methods more resilient especially with large learning rates as shown in Fig. 9.
Final Thoughts:
- Both SVMax and VICReg are well-written and well-motivated papers. Both are unsupervised and support various network architectures and tasks. Each delivers a ton of experiments that are impossible to be cover in this article. I highly recommend these papers for those interesting in feature embedding literature. PyTorch implementations are available for both SVMax and VICReg.
- Compared to VICReg, the SVMax paper is easier to read as it focuses on a single idea. In contrast, VICReg presents multiple terms and one of these terms is borrowed from another paper, the Barlow twins paper [4]
- Compared to SVMax, VICReg delivers a ton of quantitative evaluation on recent benchmarks. FAIR has the GPUs :)
- Regarding weight-decay vs. feature embedding regularizers, both SVMax and VICReg regularize the output of a single layer. In contrast, weight-decay is always applied to all network weights (layers). Accordingly, I wish a paper evaluates the impact of these feature embedding regularizers when applied on all layers. As mentioned previously, weight-decay had a significant impact in [3] and I wonder if feature regularizers have a similar impact.
References:
[1] Taha, A., Hanson, A., Shrivastava, A. and Davis, L., 2021. SVMax: A Feature Embedding Regularizer.
[2] Bardes, A., Ponce, J. and LeCun, Y., 2021. Vicreg: Variance-invariance-covariance regularization for self-supervised learning.
[3] Power, A., Burda, Y., Edwards, H., Babuschkin, I. and Misra, V., 2021. Grokking: Generalization beyond overfitting on small algorithmic datasets.
[4] Zbontar, J., Jing, L., Misra, I., LeCun, Y. and Deny, S., 2021. Barlow twins: Self-supervised learning via redundancy reduction.