Why Does Unsupervised Pre-training Help Deep Learning?

  1. Better optimization: Unsupervised pretraining puts the network in a region of parameter space where basins of attraction run deeper than when starting with random parameters. In simple words, the network starts near a global minimum. In contrast to a local minimum, a global minimum means a lower training error.
  2. Better regularization: Unsupervised pretraining puts the network in a region of parameter space in which training error is not necessarily better than when starting at random (or possibly worse), but which systematically yields better generalization (lower test error). Such behavior would be indicative of a regularization effect.
Figure 1: Evolution without pre-training (blue) and with pre-training (red) on MNIST of the log of the test NLL plotted against the log of the train NLL as training proceeds. Each of the 2x400 curves represents a different initialization. The errors are measured after each pass over the data. The rightmost points were measured after the first pass of gradient updates. Since training error tends to decrease during training, the trajectories run from the right (high training error) to left (low training error). Trajectories moving up (as we go leftward) indicate a form of overfitting. All trajectories are plotted on top of each other.
Figure 2: Comparison between 1 and 3-layer networks trained on InfiniteMNIST. Online classification error, computed as an average over a block of last 100,000 errors. To highlight that not all regularizers are created equal, three settings are used: without pretraining, with RBM pretraining, with denoising pretraining.
Figure 3: Variance of the output of a trained network with 1 layer. The variance is computed as a function of the point at which we vary the training samples.
  • The paper provides a ton of experiments, but this article provides just a peek.
  • The paper is old, even for me :) I needed multiple sessions to finish reading it. I am not familiar with Restricted Boltzmann Machines (for now) which is used during experiments. However, I managed to read the paper and learn a lot of interesting insights.
  • Most unsupervised learning papers, that use CNNs, visualize the filter of the first conv layer as shown in the next Figure. The figure shows that unsupervised pretraining learns V1-like filters given unlabeled data. These filters look like edge and blob detectors (top three rows). A global minimum solution would have V1-like filters like these. Accordingly, unsupervised pertaining is more than just a regularizer. These filters give the impression that unsupervised pretraining puts a network closer to a region of parameter space where basins of attraction run deeper. In simple words, unsupervised learning puts a network nearer to a global minimum.
Figure 3: Conv1 filters visualization [2]. (a) The filters of the first convolutional layer of a pretrained network. (b) By fine-tuning the unsupervised pre-trained network on a labeled dataset, we obtain sharper filters.
  • Finally, there is a recent paper in CVPR 2020[3] that mitigated the impact of unsupervised pretrained as the size of the dataset increases. This contradicts the findings of this paper[1] (Figure 2 in this article).



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.