Cross-Iteration Batch Normalization

Given batch-norm statistics (mean and std) at a previous iteration t-1, what is the corresponding batch-norm statistics at the current iteration t? The batch-norm statistics change for the same image-batch because the network weights change between iterations.
Using Taylor polynomial, we compute f(2.1) given our knowledge of f(x), f’(x), and f(2)=4.
The Taylor polynomial approximates a function f(x+δ) given the function’s value at nearby point x. Accordingly, the Taylor polynomial estimates are accurate in the green circle, e.g., f(2.1) and δ=0.1. If we use the Taylor polynomial to estimate f(0), we would get f(0)=-4 which is wrong. This happens because the difference between {0,2} is big, i.e., δ=2.
A matrix multiplication operation using a toy example x=θ y. While the gradient has 2x2x2=8 elements, half of these elements are always zero — no need to compute them! The computational cost of the gradient is less than expected.
Using the Taylor polynomial, CBN computes the batch-norm statistics at the current iteration using the batch-norm statistics at a previous iteration.
The gradient of batch-norm statistics depends on all preceding layers; it is computationally expensive. According to CBN, the partial gradients — from earlier layers — rapidly diminish. Motivated by this phenomenon, the paper truncates these partial gradients at layer l.
Instead of an exact gradient, an approximate gradient is used in the Taylor polynomial to reduce the computational complexity.
The batch-norm mean gradient with respect to the network’s weight is mostly zero. This is illustrated in Appendix B.
Top-1 classification accuracy vs. batch sizes per iteration. The base model is a ResNet-18 trained on ImageNet. The accuracy of vanilla batch normalization (BN) drops rapidly with small batch sizes. Batch Renormalization (BRN) stabilizes BN a little but still suffers from small batch sizes. Group-Batch Norm (GN) exhibits stable performance but underperforms BN on adequate batch sizes. Cross-iteration batch norm (CBN) compensates for the reduced batch size per GPU by exploiting approximated statistics from recent iterations (Temporal window size denotes how many recent iterations are utilized for statistics computation). CBN shows relatively stable performance over different batch sizes. Naive CBN, which directly calculates statistics from recent iterations without compensation, is shown not to work well.
Top-1 accuracy of normalization methods with different batch sizes using ResNet-18 as the base model on ImageNet.
Results of feature normalization methods on Faster R-CNN with FPN and ResNet50 on COCO.
Comparison of theoretical memory, FLOPs, and practical training and inference speed between original BN, GN, and CBN in both training and inference on COCO.
  • The paper is well-written and the code is released. It is good to see that MSR-Asia allowed code release!
  • The authors delivered a ton of experiment and ablation studies in the paper. I highly recommend reading the paper to learn more about CBN.
  • In the ablation studies section, The author made an interesting conjecture. It is argued that vanilla BN might be suffering on small batch-size in later stages of training only! If I understand this correctly, vanilla BN is useful at the early stage of training even with a small batch size!! then, BN breaks at the later stage of training. I wish the author elaborated more on this.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.