Why Does Unsupervised Pre-training Help Deep Learning?

This (G)old paper[1] tackles an interesting question: Why Does Unsupervised Pre-training Help Deep Learning? The authors support their conclusion with a ton of experiments. Yet, the findings contradict a common belief about unsupervised learning. That’s why I have contradicting feeling about this paper. I will present the paper first; then, follow up with my comments.

The authors strive to understand how does unsupervised pretrained help. There are two main hypotheses:

  1. Better optimization: Unsupervised pretraining puts the network in a region of parameter space where basins of attraction run deeper than when starting with random parameters. In simple words, the network starts near a global minimum. In contrast to a local minimum, a global minimum means a lower training error.
  2. Better regularization: Unsupervised pretraining puts the network in a region of parameter space in which training error is not necessarily better than when starting at random (or possibly worse), but which systematically yields better generalization (lower test error). Such behavior would be indicative of a regularization effect.

The paper leans towards the 2nd hypothesis, i.e., unsupervised pertaining is a regularization technique. The next figure presents this finding by comparing the training error and test error, with and without unsupervised pretraining. This is a 2010 paper; it leverages a three-layer full connected network for evaluation. The MNIST dataset is used in most experiments.

Figure 1: Evolution without pre-training (blue) and with pre-training (red) on MNIST of the log of the test NLL plotted against the log of the train NLL as training proceeds. Each of the 2x400 curves represents a different initialization. The errors are measured after each pass over the data. The rightmost points were measured after the first pass of gradient updates. Since training error tends to decrease during training, the trajectories run from the right (high training error) to left (low training error). Trajectories moving up (as we go leftward) indicate a form of overfitting. All trajectories are plotted on top of each other.

In the right-most figure, notice how the training error (x-axis) is lower without pre-training, yet the test error (y-axis) is lower with pre-training. This contradicts the better-optimization hypothesis because it assumes pre-training would achieve lower training error (run deeper into a global minimum)

Then, the paper argues that “Not all regularizers are created equal.” The unsupervised pre-training regularizer is much better compared to L1/L2 (canonical) regularizers. This is because the effectiveness of a canonical regularizer decreases as the data set grows, whereas the effectiveness of unsupervised pre-training as a regularizer is maintained as the data set grows. The next figure shows that as the dataset size increases (x-axis), the test error (y-axis) keeps decreasing with unsupervised pretraining.

Figure 2: Comparison between 1 and 3-layer networks trained on InfiniteMNIST. Online classification error, computed as an average over a block of last 100,000 errors. To highlight that not all regularizers are created equal, three settings are used: without pretraining, with RBM pretraining, with denoising pretraining.

Finally, the paper quantifies the impact of training samples’ order on the network output variance. High variance indicates that the order of the training samples significantly impacts the optimization problem. High variance is bad; a network should converge to similar solutions if trained on the same dataset and from the same random initialization. The trained network should be independent of the samples' order during training.

The next figure shows that this is not the case. Early training samples influence the output of the networks more than the ones at the end. However, this variance is lower for the pretrained networks. Finally, both networks (with and without pretraining) are more influenced by the last examples used for optimization, which is simply due to the fact that they use a stochastic gradient with a constant learning rate, where the most recent examples’ gradient has a greater influence.

Figure 3: Variance of the output of a trained network with 1 layer. The variance is computed as a function of the point at which we vary the training samples.

My comments

Figure 3: Conv1 filters visualization [2]. (a) The filters of the first convolutional layer of a pretrained network. (b) By fine-tuning the unsupervised pre-trained network on a labeled dataset, we obtain sharper filters.

[1] Why Does Unsupervised Pre-training Help Deep Learning?

[2] Unsupervised Learning of Visual Representations using Videos

[3] How Useful Is Self-Supervised Pretraining for Visual Tasks?

I write reviews on computer vision papers. Writing tips are welcomed.