Masked Autoencoders Are Scalable Vision Learners

7 min readMar 27, 2023

Annotated data is a vital pillar of deep learning. Yet, annotated data is rare in certain applications (e.g., medical and robotics). To reduce the number of annotations, self-supervised learning aims to pre-train deep networks on unannotated data to learn useful representations. Different self-supervised learning approaches propose different objectives to train a deep network with unannotated data. This paper [1] leverages the masked autoencoding objective to pre-train ViT models on images.

While the masked autoencoding objective has been proposed a long time ago, it became prominent thanks to BERT. BERT is a language model that has been pre-trained on abundant unlabeled data from the web. During pre-training, the BERT model takes a sentence and masks some words. By masking words, BERT's objective is to predict the masked words as shown in Fig. 1.

Figure 1: BERT model pre-trained with masked language. Besides the masked language, BERT is pre-trained by other objectives (e.g., Next Sentence Prediction). Yet, these secondary objectives are dropped in this article to focus on masking inputs, i.e., words in a sentence or patches in an image.

BERT has been a success in natural language processing (NLP) as it eliminates the cost required to collect and annotate labeled datasets. Despite its success in NLP, replicating BERT for vision applications has been a challenge for the following three reasons:

The architecture gap between vision and NLP: while Transformers dominate NLP applications, CNNs used to dominate vision applications. Transformers’ blocks make it easier to mask an input unit (e.g., a set of random words). In contrast, CNNs process overlapping patches in images. This makes it unnatural to mask an input unit (e.g., a set of random pixels/patches).
The density difference between pixels and words: a single word — without context — delivers valuable information, while a single image pixel — without context — delivers nothing. To predict a masked word, a sophisticated language understanding is required. In contrast, it is trivial to predict a masked pixel from neighboring patches with little high-level understanding of parts, objects, and scenes.
Predicting words is technically trivial (e.g., using an MLP), but predicting pixels is computationally expensive — there are many pixels per image. Besides its computational complexity, predicting pixel values makes little sense! For instance, if the predicted image is shifted left or right by one pixel, the model would suffer a high loss despite getting the image semantics correctly. Also, if the model predicts the image semantics correctly but with wrong pixel values (a green apple instead of a yellow apple), the model would suffer a high loss unfairly.

To tackle these three challenges, the paper

Uses a ViT model— a Transformer-based model — which has been gaining momentum in vision.
Masks a high portion of random patches. This reduces redundancy and creates a challenging self-supervisory task that requires a holistic understanding beyond low-level image statistics.
Leverages a lightweight decoder to reduce the computational complexity of predicting many pixels.

Fig. 2 presents the MAE architecture and highlights the paper’s key three ideas: (1) use a ViT encoder, (2) mask many patches, and (3) leverage a lightweight decoder.

Figure 2: The MAE architecture. During pre-training, a large random subset of image patches (e.g., 75%) is masked out. The encoder processes a small subset of visible patches. The full set of encoded patches and mask tokens is processed by a lightweight decoder that reconstructs the original image in pixels. After pre-training, the decoder is discarded and the encoder processes uncorrupted images (full sets of patches) for recognition tasks.

MAE uses the mean squared error (MSE) loss to predict the masked pixels. By reconstructing the pixel RGB values for each masked patch, MAE can produce RGB images that serve as a sanity check and deliver qualitative evaluation results as shown in Fig. 3.

Figure 3: Qualitative evaluation using COCO validation images and an ImageNet Pre-trained MAE. For each triplet, the left image shows the masked image, the middle image shows MAE’s reconstruction, and the right image shows the ground truth. Notice the reconstructions on the two right-most examples, which, although different from the ground truth, are semantically plausible. These two examples would incur a high RGB reconstruction loss unfairly.

Despite its aesthetic outputs, using RGB pixels as a target is not ideal. If the model predicts the image semantics correctly but with wrong RGB values (two right-most examples in Fig. 2), the model would suffer a high loss unfairly. Accordingly, the paper explores other reconstruction targets. For instance, the paper evaluates normalized pixel values — for each masked patch — as a reconstruction target. Specifically, every 16x16 patch is standardized using the mean and standard deviation of all pixels in a patch. Tab. 1 shows that using normalized pixels improves representation quality.

Table 1: Quantitative evaluation for different reconstruction targets. Pixel (w/o norm) denotes using RGB values as reconstruction targets. Pixel (w norm) denotes normalized patches (0 mean and 1 std). PCA denotes using the largest PCA coefficients (96 instead of all 16*16=256) as reconstruction targets. dVAE is a discrete tokenization for RGB images that has been proposed by the Microsoft research paper BEiT.

Besides its simple loss function, MAE is designed to be computationally efficient during pre-training. By masking a high portion (e.g., 75%) of the image, the MAE encoder (ViT) processes a small portion of the image. Thus, very large encoders (e.g., ViT-Huge) can be pre-trained with only a fraction of compute and memory. Tab. 2 shows that processing masked tokens by the encoder not only degrades performance but also increases the computational cost (FLOPs).

Table 2: An encoder without mask tokens is more accurate and faster.

On top of MAE’s efficient encoder, MAE leverages a lightweight decoder to reconstruct the masked pixels. The proposed decoder processes the entire image, so its efficiency is vital for MAE. Accordingly, the authors propose a shallow and thin decoder, i.e., a small number of blocks with a small embedding dimension as shown in Tab. 3. MAE’s decoder has less than 10% computation per token compared to the encoder.

Table 3: Quantitative evaluation for MAE decoders with different depths (number of blocks) and widths (embedding dimensions). A sufficiently deep decoder is important for linear probing (lin). The decoder depth is less influential for improving fine-tuning (ft). A thin decoder — with 512 embedding dimensions — works well for both linear probing and fine-tuning.

Simplicity is a key feature in MAE. Instead of proposing a fancy masking strategy, the paper leverages uniform random masking. In addition, the paper leverages simple augmentation techniques (e.g., crop and random resize) during pre-training. Tab. 4 presents quantitative evaluations for these simple features compared to other alternatives.

Table 4: Quantitative evaluations for various data augmentation techniques (left) and masking approaches (right). A random resized cropping along with uniform random masking achieves the best performance.

By default, MAE masks 75% of patches. Yet, Fig. 3 shows that MAE supports a large range of masking ratios while achieving SOTA performance.

Figure 4: Quantitative evaluation for various masking ratios. The y-axes are ImageNet-1K validation accuracy (%). A high masking ratio (75%) works well for both fine-tuning (top) and linear probing (bottom).

By masking a large portion of the input image, MAE achieves two goals: (1) it largely reduces redundancy and creates a challenging self-supervisory task that requires holistic understanding beyond low-level image statistics, (2) it reduces the wall-clock time to pre-train a given architecture for a given number of epochs as shown in Tab. 5.

Table 5: Wall-clock time of our MAE training (800 epochs), benchmarked in 128 TPU-v3 cores with TensorFlow. The speedup is relative to the entry whose encoder has mask tokens (gray). The decoder width is 512, and the mask ratio is 75%. †: This entry is estimated by training ten epochs.

MAE is quantitatively evaluated against SOTA pre-training methods (e.g., DINO) as shown in Tab. 6.

Table 6: Comparisons with previous results on ImageNet- 1K. The pre-training data is the ImageNet-1K training set (except the tokenizer in BEiT was pre-trained on 250M DALLE data [50]). All self-supervised methods are evaluated by end-to-end fine-tuning. The ViT models are B/16, L/16, H/14 [16]. The best for each column is underlined. All results are on an image size of 224, except for ViT-H with an extra result on 448. Here our MAE reconstructs normalized pixels and is pre-trained for 1600 epochs.

Fig. 4 compares MAE with fully-supervised methods using large datasets (e.g., JFT300M). It is worth noting that all these evaluations leverage end-to-end fine-tuning. Yet, linear probing evaluations are reported for hyper-parameters tuning only, i.e., tuning the decoder depth and width. Thus, the paper rarely reports linear probing evaluations against SOTA methods. I will address this point further at the end of the article.

Figure 5: MAE pre-training vs. supervised pre-training, evaluated by fine-tuning in ImageNet-1K (224 sizes). MAE is compared with the original ViT results [16] trained in IN1K or JFT300M.

Finally, the paper evaluates MAE pre-training using object detection in Tab.7, semantic segmentation in Tab.8, and transfer learning in Tab. 9. MAE achieves competitive performance on all these benchmarks. Further evaluations are reported in the paper.

Table 7: COCO object detection and segmentation evaluation using a ViT Mask R-CNN. Self-supervised entries use IN1K data without labels. Mask AP follows a similar trend as box AP.

Table 8: ADE20K semantic segmentation (mIoU) using UperNet. BEiT results are reproduced using the official code. Self-supervised entries use IN1K data without labels.

Table 9: Transfer learning accuracy on classification datasets, using MAE pre-trained on IN1K and then fine-tuned.

My Comments

[S] This is a well-written and presented paper. The MAE approach is simple and a great starting point for those interested in self-supervised learning. Kudos for releasing the code and pre-trained checkpoints.
[S] By masking 75% of an input image, MAE is computationally cheap, i.e., works on small GPUs. In addition, MAE’s loss function is independent of the batch size. So, MAE works with small batch sizes and there is no need to synchronize features/losses across GPUs, i.e., no need for distributed data-parallel tricks (e.g., gather/reduce).
[W] Indeed, MAE is both computationally cheap and batch-size independent which makes it an ideal approach for self-supervised learning. Unfortunately, MAE needs a large number of epochs during pre-training. The paper [1] used 800 epochs by default and pre-trained some models for 1600 epochs. This large number of epochs compensates for the large masking ratio (e.g., 75%). If 75% of inputs are masked during a pre-training epoch, a pre-trained model sees 25% of the dataset within this single epoch. Accordingly, four (4) epochs are required to “see” the entire dataset once during pre-training. In other words, four (4) epochs of MAE are equivalent to a single epoch of other pre-training approaches.
[W] The proposed MAE [1] is strictly entangled with ViT models. Further modifications are required to enable MAE with CNN models. Fortunately, follow-up literature [2, 3] addresses this problem.
[W] The proposed MAE [1] struggles with linear probing evaluations. Accordingly, the paper delivers fine-tuning evaluations only. Furthermore, the authors argue against linear probing evaluations citing [5] that “linear probing is not well correlated with transfer learning performance, e.g., for object detection.” Fortunately, a follow-up paper [4] proposes a contrastive loss term that boosts MAE’s linear probing performance.
[W] I don’t think this is how humans (e.g., babies) learn. Yet, no one cares about this in 2023.

References

Masked Autoencoders Are Scalable Vision Learners

Written by Ahmed Taha