Big Transfer (BiT): General Visual Representation Learning

Ahmed Taha
5 min readJul 3, 2023

Pre-trained representations bring two benefits during fine-tuning: (1) improved sample efficiency, and (2) simplified hyperparameter tuning. Towards this goal, this paper [1] provides a recipe for both pre-training and fine-tuning neural networks for vision tasks. These two steps are entangled and a good engineering recipe is essential to get the best performance.

While pre-training with unlabeled data is common these days (2023), this 2020 paper performs pre-training on labeled data, i.e,. a fully supervised setting. Despite that, this paper provides good engineering tips that should work for both supervised and self-supervised pre-training. The paper refers to this setup as Big Transfer (BiT). This article presents the pre-training and fine-tuning setups in the paper. Then, the article summarizes BiT’s engineering tips.

Pre-Training Setup

The paper learns pre-trained representation in a fully supervised manner. Accordingly, the paper leverages three labeled datasets: (1) ImageNet-1K with 1.3M images (BiT-Small), (2) ImageNet-21k with 14M images(BiT-Medium), and (3) JFT with 300M images (BiT-Large). The paper leverages ResNet models in all experiments.

Fine-tuning Setup

After pre-training, the pre-trained representations (BiT-S/M/L) are evaluated on well-established benchmarks: ImageNet-1K, CIFAR-10/100, Oxford-IIIT Pet and Oxford Flowers-102. Fig. 1 presents a quantitative evaluation for BiT on ImageNet-1K, Oxford-IIIT Pet, and CIFAR-100.

Figure 1: Effect of upstream data (shown on the x-axis) and model size on down- stream performance. Note that exclusively using more data or larger models may hurt performance; instead, both need to be increased in tandem.

The Engineering Tips

The main paper’s contribution is presenting a set of tips to achieve the best performance with minimal hyperparameter tuning.

Tip #1:

During pre-training, it is important to scale both model and dataset size. Not only is there limited benefit of training a large model size on a small dataset, but there is also limited (or even negative) benefit from training a small model on a larger dataset. Fig. 2 shows a small model (ResNet-50) achieving inferior performance with JFT-300M compared to the same model with ImageNet-21k/14M. One should not mistakenly conclude that larger datasets do not bring any additional benefit. Instead, one should scale both model and dataset size to benefit from a large dataset.

Figure 2: A ResNet-50x1 (Small blue circle) trained with JFT-300M underperforms the same architecture trained on the smaller ImageNet-21k. Thus, if one uses only a ResNet50x1, one may — mistakenly— conclude that scaling up the dataset does not bring any additional benefits. However, larger architectures (e.g., ResNet-152x4) pre-trained on JFT-300M significantly outperform those pre-trained on ILSVRC-2012.

Tip #2:

During pre-training, sufficient computational budget is crucial to learn high-performing models on large datasets. The standard ILSVRC-2012 training schedule processes roughly 100 million images (1.28M images × 90 epochs). However, the same training schedule learns an inferior model when applied to ImageNet-21k. Fig. 3 shows that increasing computational budget not only recovers ILSVRC-2012 performance, but significantly outperforms it. The paper argues that this large computational budget may have prevented wide adoption of ImageNet-21k for pre-training.

Figure 3: Applying the standard training schedule (90 epochs) of ILSVRC-2012 to the larger ImageNet-21k seems detrimental. Yet, training longer (3x and 10x) brings up the benefits of training on the larger dataset.

To further emphasize the importance of sufficient computational budget, Fig. 4 shows that JFT-300M’s validation error may not improve over a long time (8 GPU weeks) although the model is still improving as evidenced by the longer time window.

Figure 4: The learning progress of a ResNet-101x3 on JFT-300M seems to be flat even after 8 GPU-weeks, but after 8 GPU-months progress is clear.

Tip #3:

During pre-training, a large weight decay is important. A small weight decay can result in an apparent acceleration of convergence as shown in Fig.5 (orange curve). However, a small weight-decay eventually results in an inferior final model.

Figure 5: Faster initial convergence with lower weight decay may trick the practitioner into selecting a sub-optimal value. Higher weight decay converges more slowly, but results in a better final model.

A small weight decay results in growing weight norms, which in turn reduce the impact of a given learning rate lr. In other words, a small lr cannot move large weights that grew due to a small weight decay. So, a small weight decay creates an impression of faster convergence, but it eventually prevents further progress. A sufficiently large weight decay is required to avoid this effect.

Tip #4:

During both pre-training and fine-tuning, replace Batch Normalization (BN) with Group Normalization (GN) and Weight Standardization (WS). Batch Normalization degrades with small per-device batch sizes which are expected with large models (e.g., ResNet-152). To tackle this problem, one can accumulate BN statistics across all of the accelerators. However, this introduces two new problems: (1) computing BN statistics across large batches has been shown to harm generalization; (2) using global BN requires many aggregations across accelerators which increases latency. Fig. 6 shows that GN+WS outperforms BN significantly while supporting a large overall batch-size.

Figure 6: GroupNorm+Weight Standardization outperforms BatchNorm by large margins especially for small mini-batch per gpu.

Tip #5:

During fine-tuning, omit various regularization techniques such as weight-decay and Dropout. The paper fixed most hyperparameters (e.g., learning rate, optimizer, momentum) across various downstream tasks. Only three hyperparameters are tuned per-task: (1) training schedule length, (2) resolution, and (3) whether to use MixUp or not.

Tip #6:

During pre-training, Mixup [2] augmentation is not useful due to data abundance. Mixup is useful during fine-tuning on medium or large-sized datasets (20–500k), but not for small datasets (< 20k).

Through these tips, BiT achieves SOTA and outperforms both generalized and specialist representations as shown in Fig. 7. Generalized approaches are pre-trained independently of downstream tasks, while specialist approaches rely on a task-dependent auxiliary training. Specialist representations achieve better performance, but require a large training cost per task. In contrast, generalized representations require large-scale training only once, followed by a low-cost fine-tuning phase.

Figure 7: BiT achieves SOTA and outperforms both generalized and specialist representations.

My Comments

  • The paper [1] is well-organized and delivers valuable tips for those interested in pre-training. While the paper assumes a fully-supervised pre-training setup, the paper is valuable for those doing self-supervised learning as well.
  • The paper delivers a deep analysis on various tasks (e.g., Object classification and detection). Yet, all experiments leverage natural images and ResNet architectures.
  • Some of the proposed tricks are architecture-specific. For instance, replacing BatchNorm with GroupNorm+Weight Standardization works for the ResNet architecture, but not for recent architectures (e.g., ViTs) that use LayerNorm.


  1. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S. and Houlsby, N., 2020. Big transfer (bit): General visual representation learning. ECCV 2020.
  2. Zhang, H., Cisse, M., Dauphin, Y.N. and Lopez-Paz, D., mixup: Beyond empirical risk minimization. ICLR 2018.