Cross-Iteration Batch Normalization

Given batch-norm statistics (mean and std) at a previous iteration t-1, what is the corresponding batch-norm statistics at the current iteration t? The batch-norm statistics change for the same image-batch because the network weights change between iterations.
Using Taylor polynomial, we compute f(2.1) given our knowledge of f(x), f’(x), and f(2)=4.
The Taylor polynomial approximates a function f(x+δ) given the function’s value at nearby point x. Accordingly, the Taylor polynomial estimates are accurate in the green circle, e.g., f(2.1) and δ=0.1. If we use the Taylor polynomial to estimate f(0), we would get f(0)=-4 which is wrong. This happens because the difference between {0,2} is big, i.e., δ=2.
A matrix multiplication operation using a toy example x=θ y. While the gradient has 2x2x2=8 elements, half of these elements are always zero — no need to compute them! The computational cost of the gradient is less than expected.
Using the Taylor polynomial, CBN computes the batch-norm statistics at the current iteration using the batch-norm statistics at a previous iteration.
The gradient of batch-norm statistics depends on all preceding layers; it is computationally expensive. According to CBN, the partial gradients — from earlier layers — rapidly diminish. Motivated by this phenomenon, the paper truncates these partial gradients at layer l.
Instead of an exact gradient, an approximate gradient is used in the Taylor polynomial to reduce the computational complexity.
The batch-norm mean gradient with respect to the network’s weight is mostly zero. This is illustrated in Appendix B.
Top-1 classification accuracy vs. batch sizes per iteration. The base model is a ResNet-18 trained on ImageNet. The accuracy of vanilla batch normalization (BN) drops rapidly with small batch sizes. Batch Renormalization (BRN) stabilizes BN a little but still suffers from small batch sizes. Group-Batch Norm (GN) exhibits stable performance but underperforms BN on adequate batch sizes. Cross-iteration batch norm (CBN) compensates for the reduced batch size per GPU by exploiting approximated statistics from recent iterations (Temporal window size denotes how many recent iterations are utilized for statistics computation). CBN shows relatively stable performance over different batch sizes. Naive CBN, which directly calculates statistics from recent iterations without compensation, is shown not to work well.
Top-1 accuracy of normalization methods with different batch sizes using ResNet-18 as the base model on ImageNet.
Results of feature normalization methods on Faster R-CNN with FPN and ResNet50 on COCO.
Comparison of theoretical memory, FLOPs, and practical training and inference speed between original BN, GN, and CBN in both training and inference on COCO.
  • The paper is well-written and the code is released. It is good to see that MSR-Asia allowed code release!
  • The authors delivered a ton of experiment and ablation studies in the paper. I highly recommend reading the paper to learn more about CBN.
  • In the ablation studies section, The author made an interesting conjecture. It is argued that vanilla BN might be suffering on small batch-size in later stages of training only! If I understand this correctly, vanilla BN is useful at the early stage of training even with a small batch size!! then, BN breaks at the later stage of training. I wish the author elaborated more on this.

--

--

--

I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

I used a simple data file to get my Boston apartment in AirBnB

The Most Basic Fundamental Skills For Data Scientist

Taken from Phyton.org

Predictive modeling with Machine Learning in R — Part 4 (Classification — Advanced)

Inferential Statistics Examples Research Paper

An early-stage founder’s guide to building out your data stack

NumPy reshape can break your heart

Top Twitter Topics by Data Scientists #10

Supervised machine learning — Binary logistic regression overview

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

More from Medium

Semi-Supervised Learning | Pseudo Labeling Custom Dataset with YOLOv4

Integrating Edge computing in Computer vision applications

Re/Evolution of understanding visual feature in Computer Vision

Faster R-CNN Object Detection with Region Proposal Networks