Why Does Unsupervised Pre-training Help Deep Learning?

This (G)old paper[1] tackles an interesting question: Why Does Unsupervised Pre-training Help Deep Learning? The authors support their conclusion with a ton of experiments. Yet, the findings contradict a common belief about unsupervised learning. That’s why I have contradicting feeling about this paper. I will present the paper first; then, follow up with my comments.

The authors strive to understand how does unsupervised pretrained help. There are two main hypotheses:

  1. Better optimization: Unsupervised pretraining puts the network in a region of parameter space where basins of attraction run deeper than when starting with random parameters. In simple words, the network starts near a global minimum. In contrast to a local minimum, a global minimum means a lower training error.
  2. Better regularization: Unsupervised pretraining puts the network in a region of parameter space in which training error is not necessarily better than when starting at random (or possibly worse), but which systematically yields better generalization (lower test error). Such behavior would be indicative of a regularization effect.

The paper leans towards the 2nd hypothesis, i.e., unsupervised pertaining is a regularization technique. The next figure presents this finding by comparing the training error and test error, with and without unsupervised pretraining. This is a 2010 paper; it leverages a three-layer full connected network for evaluation. The MNIST dataset is used in most experiments.

Figure 1: Evolution without pre-training (blue) and with pre-training (red) on MNIST of the log of the test NLL plotted against the log of the train NLL as training proceeds. Each of the 2x400 curves represents a different initialization. The errors are measured after each pass over the data. The rightmost points were measured after the first pass of gradient updates. Since training error tends to decrease during training, the trajectories run from the right (high training error) to left (low training error). Trajectories moving up (as we go leftward) indicate a form of overfitting. All trajectories are plotted on top of each other.

In the right-most figure, notice how the training error (x-axis) is lower without pre-training, yet the test error (y-axis) is lower with pre-training. This contradicts the better-optimization hypothesis because it assumes pre-training would achieve lower training error (run deeper into a global minimum)

Then, the paper argues that “Not all regularizers are created equal.” The unsupervised pre-training regularizer is much better compared to L1/L2 (canonical) regularizers. This is because the effectiveness of a canonical regularizer decreases as the data set grows, whereas the effectiveness of unsupervised pre-training as a regularizer is maintained as the data set grows. The next figure shows that as the dataset size increases (x-axis), the test error (y-axis) keeps decreasing with unsupervised pretraining.

Figure 2: Comparison between 1 and 3-layer networks trained on InfiniteMNIST. Online classification error, computed as an average over a block of last 100,000 errors. To highlight that not all regularizers are created equal, three settings are used: without pretraining, with RBM pretraining, with denoising pretraining.

Finally, the paper quantifies the impact of training samples’ order on the network output variance. High variance indicates that the order of the training samples significantly impacts the optimization problem. High variance is bad; a network should converge to similar solutions if trained on the same dataset and from the same random initialization. The trained network should be independent of the samples' order during training.

The next figure shows that this is not the case. Early training samples influence the output of the networks more than the ones at the end. However, this variance is lower for the pretrained networks. Finally, both networks (with and without pretraining) are more influenced by the last examples used for optimization, which is simply due to the fact that they use a stochastic gradient with a constant learning rate, where the most recent examples’ gradient has a greater influence.

Figure 3: Variance of the output of a trained network with 1 layer. The variance is computed as a function of the point at which we vary the training samples.

My comments

  • The paper provides a ton of experiments, but this article provides just a peek.
  • The paper is old, even for me :) I needed multiple sessions to finish reading it. I am not familiar with Restricted Boltzmann Machines (for now) which is used during experiments. However, I managed to read the paper and learn a lot of interesting insights.
  • Most unsupervised learning papers, that use CNNs, visualize the filter of the first conv layer as shown in the next Figure. The figure shows that unsupervised pretraining learns V1-like filters given unlabeled data. These filters look like edge and blob detectors (top three rows). A global minimum solution would have V1-like filters like these. Accordingly, unsupervised pertaining is more than just a regularizer. These filters give the impression that unsupervised pretraining puts a network closer to a region of parameter space where basins of attraction run deeper. In simple words, unsupervised learning puts a network nearer to a global minimum.
Figure 3: Conv1 filters visualization [2]. (a) The filters of the first convolutional layer of a pretrained network. (b) By fine-tuning the unsupervised pre-trained network on a labeled dataset, we obtain sharper filters.
  • Finally, there is a recent paper in CVPR 2020[3] that mitigated the impact of unsupervised pretrained as the size of the dataset increases. This contradicts the findings of this paper[1] (Figure 2 in this article).

[1] Why Does Unsupervised Pre-training Help Deep Learning?

[2] Unsupervised Learning of Visual Representations using Videos

[3] How Useful Is Self-Supervised Pretraining for Visual Tasks?

I write reviews on computer vision papers. Writing tips are welcomed.