Sharpness-Aware Minimization for Efficiently Improving Generalization

Figure 1: Two different minima with the same loss value but different sharpness (curvature). Left Minimum: A local minimum with high curvature, and expected high generalization error. Right Minimum: A local minimum with low curvature, and expected low generalization error. L_s(w) is the loss value on the training dataset. L_s(w+ϵ) is a surrogate for the loss curvature.
Figure 2: SAM two-step optimization from two different perspectives. (Left) SAM visualization in 2D where the red-curve is the loss curve, and arrows denote the gradient (Right) SAM visualization in 3D surface. Starting at the yellow circle, SGD would reach the red circle after one optimization step, while SAM would reach the green circle.
Table 1: SAM evaluation using state-of-the-art models on CIFAR-{10, 100} (WRN = WideResNet;
AA = AutoAugment; SGD is the standard non-SAM procedure used to train these models).
Table 2: SAM evaluation using ResNets on ImageNet.
Table 3: Top-1 error rates for finetuning EfficientNet on various downstream tasks
Table 4: Test accuracy on a clean test split for models trained on CIFAR10 with noisy labels.
  • [W] I wish the paper emphasized that SAM achieves significant improvement when coupled with over-parameterized networks, and not every network. I found this empirically by evaluating SAM on two networks ResNet18 and WideResNet50. Improvement margins are marginal on ResNet18, but significant on WideResNet50 as shown in the following table. A similar finding has been reported in [3] — “the degree of improvement negatively correlates with the level of inductive biases built into the architecture.
Table 5: SAM evaluation on two architectures: Compact-ResNet18 and Overparameterized-WideResNet50. Minimal augmentation is used in this evaluation. SAM brings marginal improvement to ResNet18 but boosts performance significantly on WideResNet50.
  • [S] The thing I like most about this paper is that it reminds me that we use gradient descent because it is computationally feasible, not because it is optimal. In 2022, we can only compute the 1st derivative with respect to the network’s weights, while the 2nd derivative (curvature) is computationally prohibitive. Once this computational limitation is unlocked, more can be explored [2].
  • For those interested in the topic, A follow-up paper [4] proposes adaptive SAM (ASAM). Developed by Samsung Research, ASAM makes tuning the ρ hyperparameter easier. I found ASAM useful when having a multi-stage training procedure (pre-training -> fine-tuning). A fixed ρ achieved inferior performance with SAM, but superior performance with ASAM.

--

--

--

I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

PyTorch: A Brief Intro

SSD Object Detection in Real Time (Deep Learning and Caffe)

Fixed Feature Extractor as the Transfer Learning Method for Image Classification Using MobileNet

Predict Stock Market Movements with Machine Learning

The Essential Guide to Content Moderation

Real estate market analysis for King County with Python

Simplify Machine Learning for Beginners

Perceptron: A gateway to the World of Neural Network

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

More from Medium

Paper Review: Patch-based inference for TinyML

How to Implement an Efficient LayerNorm CUDA Kernel — OneFlow Performance Optimization

Review — A ConvNet for the 2020s

Speaking Code: LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS