Understanding the Effective Receptive Field in Deep Convolutional Neural Networks
In deep networks, a receptive field — or field of view — is the region in the input space that affects the features of a particular layer as shown in Fig.1. The receptive field is important for understanding and diagnosing a network’s performance. A deep networks should be designed with a receptive field that covers the entire relevant image region because the network is oblivious to regions outside its receptive field.
Different network architectures have different receptive fields. Assuming shallow architectures, convolutional neural networks (CNNs) have a smaller receptive field compared to transformers as shown in Fig. 2. In CNNs, the receptive field grows incrementally one layer after another. In transformers, however, the receptive field spans all input (tokens) after a single layer. Yet, these receptive fields’ estimates are only theoretical!
In CNNs, the pixels at the center of a receptive field have a large impact on the output. In the forward pass, the center pixels can propagate information to the output through many different paths, while boundary pixels have very few paths to propagate their values as shown in Fig. 3. Therefore, during a backward pass, the center pixels have a much larger gradient magnitude from that output. In this paper [1], Luo et al. evaluate the receptive field in CNNs empirically and coin the term effective receptive field (ERF).
Luo et al.[1] show that the ERF both follows a Gaussian distribution and occupies only a fraction of the full theoretical receptive field (TRF). To evaluate the ERF, the paper computes the gradient of the output feature map w.r.t. a given input. To compute the ERF, the multi-dimensional output feature map is reduced into a scalar using a constant Dirac delta as shown in Fig. 4. Basically, the ERF is computed using the center pixel in the output feature map.
While the TRF depends on the architecture only, the ERF dy/dx is dependent on the input, i.e., different inputs generate different ERFs dy/dx. Thus, a single ERF computation is not enough. Accordingly, Luo et al.[1] average the ERF over 20 runs (inputs). Initially, the paper evaluates the ERF using randomly initialized networks. These networks are initialized either uniformly (all ones) or randomly. Once initialized, these networks are fixed to compute the average ERF over 20 runs (inputs).
Fig. 5 shows perfect Gaussian shapes for uniformly and randomly initialized convolution kernels without nonlinear activations. Also, the figure shows near Gaussian shapes for randomly weighted kernels with a RELU nonlinearity. Adding the ReLU nonlinearity makes the distribution a bit less Gaussian. ReLU produces exactly zero for half of its inputs and it is easy to get a zero output. This means few paths from the receptive field reach the output.
Fig. 6 shows the ERF for randomly initialized 20-layer networks. These three networks leverage different non-linearities: RELU, Tanh, and Sigmoid. The ERF is averaged over 100 runs (input) with different random weights as well as different random inputs. Fig. 6 shows the ERF a lot more Gaussian-like.
In computer vision literature, it is typical to use subsampling to reduce the feature resolution as the network depth increases. This paper highlights the importance of both subsampling and dilated-convolutions to increase the ERF. Fig. 7 shows how subsampling and dilated convolution increase the ERF significantly.
Fig. 8. evaluates the receptive field before and after training CNNs. The effective receptive field grows significantly after training. In the CIFAR experiment (Fig. 8 left), the theoretical receptive field is 74x74, i.e., bigger than the input image 32x32. Yet, ERF still won’t cover the input image.
Finally, the paper concludes with an informal relationship between the effective and theoretical receptive fields. The paper fits a line between the number of layers (x-axis) and ERF-ratio (y-axis). Fig. 9 shows the ERF ratio (ERF/TRF) has a -0.43 slope with the number of layers (x-axis). This finding is informal because it is architecture depended. While a negative slope is always expected, the slope value will differ from one architecture to another.
My comments:
- [S] This is a nice paper and especially important for fields that process high-resolution inputs.
- [S] The paper delivers impressive quantitative evaluations.
- [S] The paper tried to propose an initialization scheme to increase the receptive field. While this scheme speeds up convergence by 30%, the authors admit that the overall benefit is not significant. Praise the honest authors :)
- [W] I find the paper’s mathematical formulation confusing. The paper’s idea is simple and could have been presented in a simpler way.
- To increase the receptive field of CNNs, the paper uses downsampling and dilation. Deformable convolution [2] is another recent alternative to increase the ERF.
- I am not aware of any similar paper that evaluates the effective receptive field of Transformers. While a single attention layer can cover the entire input signal (tokens), this is just theoretically. Accordingly, a paper that quantifies the receptive fields of either vanilla Transformers or CvT would be interesting.
References:
[1] Luo, W., Li, Y., Urtasun, R. and Zemel, R., 2016. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29.
[2] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H. and Wei, Y., 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 764–773).