Sigmoid Loss for Language Image Pre-Training

Ahmed Taha
9 min readMar 18, 2024

--

Contrastive Language Image Pre-training (CLIP) has gained significant momentum after OpenAI’s CLIP paper [2]. CLIP uses image-text pairs to pre-train a network with a contrastive loss. This approach has multiple advantages: (1) It is relatively cheap to collect image-text pairs datasets by scraping the internet; (2) It enables zero-shot transfer to downstream tasks (e.g., Image classification/Retrieval); (3) Its performance scales with the model and dataset sizes, i.e., bigger networks and datasets achieve better performance.

Figure 1: During training, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. During testing, the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

Unfortunately, CLIP comes with two technical challenges: (1) It requires a large-batch size. For instance, CLIP [1] used 32K batch size which requires a lot of GPUs; (2) It requires a lot of communication between these GPUs. Concretely, both image and text features are gathered (all-gather) by all GPUs. This is a lot of communication given the large batch-size required. Multiple papers have tackled this large batch-size requirement. For instance, MoCo [3,4,5] leverages offline queue to reduce the batch-size requirement. Another approach, SimSiam [6], leverages a stop-gradient trick to eliminate negative samples. Accordingly, SimSiam features — produced by one GPU— are no longer transferred (all-gathered) across GPUs.

This paper [1] proposes yet another approach, SigLIP, to reduce the batch-size requirement of CLIP. SigLip is short-for Sigmoid Language Image Pre-training. SigLIP’s key idea is to use a sigmoid operation instead of a softmax operation. CLIP uses a softmax function and, accordingly, the loss for a given positive pair (image-text) is dependent on every negative pair within a mini-batch as shown in Eq. 1.

Equation 1: CLIP uses softmax operation. Accordingly, the similarity of every positive-pair is normalized by *all* negative pairs. Thus, every GPU makes maintains an NxN matrix for all pairwise similarities. This brings quadratic complexity to CLIP.

where N denotes the batch-size (number of positive pairs), x denotes the image features, y denotes the text features, and t is a scalar temperature hyperparameter to control the sharpness/smoothness of the softmax output. There are two key details in Eq. 1: (1) CLIP (softmax) loss is asymmetric with two terms. The first term finds the best text match for a given query image while the second term finds the best image match for a given query text. (2) CLIP (softmax) loss requires a global normalization factor (highlighted denominator) which introduces quadratic memory complexity — specifically, an NxN pairwise similarity matrix.

In contrast, SigLIP is neither asymmetric nor requires global normalization factor. Accordingly, the loss for every pair — both positive and negative — is independent of other pairs within a mini-batch as shown in Eq. 2.

Equation 2: SigLIP uses sigmoid operation and each image-text pair (positive or negative) is evaluated independently. There is no need to maintain a global NxN normalization matrix. Accordingly, SigLIP loss can be evaluated incrementally for large batch-sizes.

It is worth-noting that both CLIP and SigLIP compute the similarity between every pair (positives/negatives) within a mini-batch. Yet, a subtle difference appears in the memory requirement of each loss. With CLIP, every GPU maintains an NxN matrix for all pairwise similarities in order to normalize positive pairs. With SigLIP, there is no need to maintain the NxN matrix since every positive/negative pair is independent.

Another way to understand the difference between CLIP and SigLIP is to inspect their problem formulations. Given a query image I, CLIP solves a multi-class classification problem and assigns the image I to its corresponding positive text T out of all other negative texts within a mini-batch. Contrary, SigLIP solves a binary classification problem with a positive label for a matching pair (I , T ) and a negative label for all other pairs. Accordingly, CLIP computes global normalization factors (Eq. 1 denominator) while SigLIP doesn’t.

Due to these differences, SigLIP requires less communication between GPUs compared to CLIP. CLIP passes both images and text features between all GPUs to compute the NxN normalization matrix. This costs two all-gather operations. In contrast, SigLIP passes text features only between all GPUs to compute all pairwise similarities. This costs a single all-gather operation. Of course, a single all-gather operation is cheaper than two. Yet, an all-gather operation is still expensive because all GPUs stay idle while waiting to receive all features before computing the loss (Eq. 2). Imagine a mini-batch distributed on 256 GPUs; every GPU will wait till it receive features from all other 255 GPUs before computing the loss. This is a lot of waiting time! So, the paper proposes an efficient “chunked” implementation to avoid all-gather altogether.

The efficient implementation aims to do both loss computation and feature communication incrementally. Since SigLIP operates on every image-text pair independently, SigLIP can be evaluated incrementally. Fig. 2 illustrates this idea using a toy setup, a mini-batch of size 12 distributed on 3 GPUs. In this example, there are 12 positive pairs and 132 negative (off-diagonal) pairs. Each GPU computes the loss of its on-device mini-batch (size 4). Then, each GPU passes its text features to a single sibling GPU. In Fig. 2c, GPU (device) #1 receives text features from GPU#2, GPU#2 from GPU#3, and GPU#3 from GPU#1. Now, each GPU has a new set of negative pairs: its own image features plus text features from the sibling GPU. So, a new loss is computed and accumulated on the previously computed loss. SigLIP repeats these two steps (loss computation and feature communication) till the entire mini-batch’s total-loss is computed.

Figure 2: Efficient SigLIP demonstrated on a toy setup with 3 GPUs and a global batch size of 12. There are no all-gathers, and at any point in time only the bright yellow square (size 4 × 4) is materialized in memory.

The “chunked” implementation can be summarized as follows for each GPU (device)

  1. Compute loss on its own text and image features.
  2. receives text features from a single sibling GPU
  3. Compute a new loss using both its own image and its sibling text features.
  4. Increment the total loss by the newly computed loss.
  5. Repeat step #2 till all sibling GPUs pass their text features.

In the experiments section, SigLIP (sigmoid) is evaluated against CLIP (softmax). After pre-training a model, zero-shot performance on ImageNet is reported. Fig. 2 compares SigLIP and CLIP using ImageNet zero-shot performance (y-axis) for different batch-sizes (x-axis) during pre-training. The key findings are: (1) SigLIP achieves superior performance to CLIP with a small batch-size (e.g., 4–8k); this is important because many researchers lack the computational budget (GPUs) for large batch-size; (2) while related literature claims large batch sizes boost performance, this paper [1] shows that both SigLIP and CLIP saturate at 32k batch size; (3) As the batch-size increases, the performance gap between SigLIP and CLIP diminishes.

Figure 3: SigLIP (sigmoid) vs CLIP (softmax) quantitative evaluation using image and text encoders trained from scratch, i.e., randomly initialized encoders. The y-axis denotes the ImageNet zero-shot performance while the x-axis denotes various training mini-batch size. SigLIP achieves superior performance to CLIP with a small batch-size. Both SigLIP and CLIP saturate at 32k batch size.

The authors of [1] have a previous paper [7] that aims to reduce the cost of pre-training Language-Image models. In [7], Zhai et al. take advantage of pre-trained image-encoders because training image-encoder from scratch is computationally expensive. Zhai et al. [7] load a pre-trained image encoder and lock the weights of the model during pre-training. This method is called LiT, short-for Locked-image Tuning. LiT has a significant lower computational cost compared to vanilla CLIP because, during pre-training, only the text encoder’s gradients are computed. In another words, while CLIP computes both forward and backward for the image-encoder, LiT computes the forward pass only. Since LiT [7] is published before SigLIP, LiT is originally evaluated using CLIP loss. Now, Zhai et al. [1] compares LiT using both CLIP and SigLIP losses as shown in Fig. 4.

Figure 4: SigLiT (green curve) vs CLIP (orange curve) quantitative evaluation using frozen (pretrained) image encoder, i.e., only text encoder is trained from scratch. Again, SigLiT achieves superior performance to CLIP with a small batch-size. As the batch-size increases, the performance gap between SigLIP and CLIP diminishes.

The paper presents several experiments that are missing from this article. For instance, SigLIP is evaluated against other benchmarks (CLIP, EVA-CLIP), and is ablated with various hyperparameters and optimizers. This article omits these experiments but concludes with one final interesting experiment. The paper evaluates which loss (CLIP vs SigLIP) is more robust to label-noise. This experiment is important because Language-Image training datasets are usually scraped from the internet. For practical purposes, it is assumed that there is a single text match for each image query and vice versa. This assumption is usually noisy and imperfect. While it is difficult to quantify the noise-level in a training dataset, it is possible to corrupt training data synthetically using one of the following five methods:

  1. Image: With probability p, replace the image with uniform random noise.
  2. Text: With probability p, replace tokenized text with a new sequence of randomly sampled tokens, up to some (sampled) sequence length.
  3. Batch alignment: Randomly shuffle the ordering of p% of the batch.
  4. Image & text: Apply both Image and Text methods with probability p each.
  5. Image, text & batch: Alongside Image & text method, also shuffle fraction p of alignments.

Fig. 5 compares CLIP (softmax) vs. SigLIP (sigmoid) when the training dataset is corrupted using one of the five aforementioned corruption approaches. Clearly, SigLIP is consistently more robust to noise-label. I don’t know how to explain this behavior and the paper reported the experiment result without any further discussion!

Figure 5: SigLIP (sigmoid) is robust to data noise compared to CLIP (softmax). Titles show the type of corruption applied while x-axes show the probability with which noise is applied. With increasing corruption severity, models trained with sigmoid loss retain superiority over corresponding softmax baseline.

My Comments:

[S1] The paper is well-organized and tackles an important problem which is how to train large scale deep networks efficiently with limited computational budget (GPUs/TPUs). I highly recommend this paper for anyone interested in Language-Image/Multi-Modal pre-training.

[S2] The paper did a great job in the experiment section. The paper both evaluates SigLIP against many benchmarks (e.g., CLIP, OpenCLIP, etc) and ablates SigLIP hyperparameters (temperature and bias term). Furthermore, the paper leverages SigLIP to identify important negative samples (Easy vs. Random vs. Hard). While the conclusion of this experiment (Fig. 6) is expected, this paper is the first — to my knowledge — to perform this evaluation on a large scale setup.

[S3] While both SigLIP and MoCo aim to reduce the batch-size requirement, SigLIP copes better with bigger batch-sizes. MoCo’s authors [3] abandon the offline queue idea because it has diminishing return beyond a 4k mini-batch. Yet, I think MoCo can’t leverage large batch-sizes. Clearly, CLIP and SigLIP achieve better performance with bigger batch-sizes up to 32k. MoCo is limited by the offline nature of its queue. When the offline queue it is too large, most of the queue’s embeddings are out-of-date (easy negatives). These contribute little to the training process.

[W1] This DeepMind team has published multiple papers [1,7,8] about how to train Vision/Language models efficiently. Basically, they have good recipes for efficient training (e.g., disable weight-decay during fine-tuning). In this paper, they used different optimizers (e.g., LION, Adafactor), while AdamW is used for ablation studies only. I wish they elaborated on these different optimizers.

[W2] I don’t like that the authors release a pseudo code a single GPU only which contradicts the paper’s objective — namely, training on multiple accelerators (TPUs/GPUs) efficiently. Fortunately, Ross Wightman released his distributed implementation for Sigmoid loss.

[S/W] From my practical experience with both CLIP and SigLIP, SigLIP achieves superior performance. Yet, the performance boost is not huge. It is hard to tell whether this boost stems from SigLIP’s superiority or not tuning CLIP’s hyperparameters hard enough!

References:

[1] Zhai, Xiaohua, et al. “Sigmoid loss for language image pre-training.” ICCV 2023.

[2] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” ICML . PMLR, 2021.

[3] He, Kaiming, et al. “Momentum contrast for unsupervised visual representation learning.” CVPR 2020.

[4] Chen, Xinlei, et al. “Improved baselines with momentum contrastive learning.” arXiv 2020.

[5] Chen, X., S. Xie, and K. He. “An empirical study of training self-supervised vision transformers.” ICCV 2021.

[6] Chen, Xinlei, and Kaiming He. “Exploring simple siamese representation learning.” CVPR 2021.

[7] Zhai, Xiaohua, et al. “Lit: Zero-shot transfer with locked-image text tuning.” CVPR 2022.

[8] Kolesnikov, Alexander, et al. “Big transfer (bit): General visual representation learning.” ECCV 2020

--

--