Pre-trained representations bring two benefits during fine-tuning: (1) improved sample efficiency, and (2) simplified hyperparameter tuning. Towards this goal, this paper  provides a recipe for both pre-training and fine-tuning neural networks for vision tasks. These two steps are entangled and a good engineering recipe is essential to get the best performance.
While pre-training with unlabeled data is common these days (2023), this 2020 paper performs pre-training on labeled data, i.e,. a fully supervised setting. Despite that, this paper provides good engineering tips that should work for both supervised and self-supervised pre-training. The paper refers to this setup as Big Transfer (BiT). This article presents the pre-training and fine-tuning setups in the paper. Then, the article summarizes BiT’s engineering tips.
The paper learns pre-trained representation in a fully supervised manner. Accordingly, the paper leverages three labeled datasets: (1) ImageNet-1K with 1.3M images (BiT-Small), (2) ImageNet-21k with 14M images(BiT-Medium), and (3) JFT with 300M images (BiT-Large). The paper leverages ResNet models in all experiments.
After pre-training, the pre-trained representations (BiT-S/M/L) are evaluated on well-established benchmarks: ImageNet-1K, CIFAR-10/100, Oxford-IIIT Pet and Oxford Flowers-102. Fig. 1 presents a quantitative evaluation for BiT on ImageNet-1K, Oxford-IIIT Pet, and CIFAR-100.
The Engineering Tips
The main paper’s contribution is presenting a set of tips to achieve the best performance with minimal hyperparameter tuning.
During pre-training, it is important to scale both model and dataset size. Not only is there limited benefit of training a large model size on a small dataset, but there is also limited (or even negative) benefit from training a small model on a larger dataset. Fig. 2 shows a small model (ResNet-50) achieving inferior performance with JFT-300M compared to the same model with ImageNet-21k/14M. One should not mistakenly conclude that larger datasets do not bring any additional benefit. Instead, one should scale both model and dataset size to benefit from a large dataset.
During pre-training, sufficient computational budget is crucial to learn high-performing models on large datasets. The standard ILSVRC-2012 training schedule processes roughly 100 million images (1.28M images × 90 epochs). However, the same training schedule learns an inferior model when applied to ImageNet-21k. Fig. 3 shows that increasing computational budget not only recovers ILSVRC-2012 performance, but significantly outperforms it. The paper argues that this large computational budget may have prevented wide adoption of ImageNet-21k for pre-training.
To further emphasize the importance of sufficient computational budget, Fig. 4 shows that JFT-300M’s validation error may not improve over a long time (8 GPU weeks) although the model is still improving as evidenced by the longer time window.
During pre-training, a large weight decay is important. A small weight decay can result in an apparent acceleration of convergence as shown in Fig.5 (orange curve). However, a small weight-decay eventually results in an inferior final model.
A small weight decay results in growing weight norms, which in turn reduce the impact of a given learning rate lr. In other words, a small lr cannot move large weights that grew due to a small weight decay. So, a small weight decay creates an impression of faster convergence, but it eventually prevents further progress. A sufficiently large weight decay is required to avoid this effect.
During both pre-training and fine-tuning, replace Batch Normalization (BN) with Group Normalization (GN) and Weight Standardization (WS). Batch Normalization degrades with small per-device batch sizes which are expected with large models (e.g., ResNet-152). To tackle this problem, one can accumulate BN statistics across all of the accelerators. However, this introduces two new problems: (1) computing BN statistics across large batches has been shown to harm generalization; (2) using global BN requires many aggregations across accelerators which increases latency. Fig. 6 shows that GN+WS outperforms BN significantly while supporting a large overall batch-size.
During fine-tuning, omit various regularization techniques such as weight-decay and Dropout. The paper fixed most hyperparameters (e.g., learning rate, optimizer, momentum) across various downstream tasks. Only three hyperparameters are tuned per-task: (1) training schedule length, (2) resolution, and (3) whether to use MixUp or not.
During pre-training, Mixup  augmentation is not useful due to data abundance. Mixup is useful during fine-tuning on medium or large-sized datasets (20–500k), but not for small datasets (< 20k).
Through these tips, BiT achieves SOTA and outperforms both generalized and specialist representations as shown in Fig. 7. Generalized approaches are pre-trained independently of downstream tasks, while specialist approaches rely on a task-dependent auxiliary training. Specialist representations achieve better performance, but require a large training cost per task. In contrast, generalized representations require large-scale training only once, followed by a low-cost fine-tuning phase.
- The paper  is well-organized and delivers valuable tips for those interested in pre-training. While the paper assumes a fully-supervised pre-training setup, the paper is valuable for those doing self-supervised learning as well.
- The paper delivers a deep analysis on various tasks (e.g., Object classification and detection). Yet, all experiments leverage natural images and ResNet architectures.
- Some of the proposed tricks are architecture-specific. For instance, replacing BatchNorm with GroupNorm+Weight Standardization works for the ResNet architecture, but not for recent architectures (e.g., ViTs) that use LayerNorm.