Distilling the Knowledge in a Neural Network

4 min readSep 24, 2019

Deep neural networks achieve impressive performance results but are computational expensive during both training and testing. Using a trained network on a mobile device for inference is a desirable feature to reduce server/network traffic and improve system scalability. This motivates the development of compact networks like MobileNet. Unfortunately, these compact networks lag significantly behind very-deep architectures. To reduce this performance gap, research has focused on two main research directions: network compression and network distillation.

Network compression research assumes the network knowledge is encapsulated in its layers and weights. Thus, network compression approaches provide a similar network architecture after pruning some weights like convolutional filters/kernels. On the other hand, network distillation assumes the network activations express network knowledge. Thus, these approaches train a compact distilled network that mimics the activation behavior in a cumbersome network. This cumbersome network, also known as a teacher network, can be a single bulky network or an ensemble of networks as shown in the next figure.

Caruana et al.,[2] used the cumbersome network’s logits (before softmax) as soft-labels to train the distilled model. The softmax’s output probability is dismissed because it is sparse, i.e., most probabilities are close to zero. This suppresses any cross-class similarity knowledge. The knowledge distillation paper[1] proposes a new softmax variant by introducing a temperature parameter as follows

Softmax with temperature T parameter for knowledge distillation

where z is the logit and q is the class probability. This custom function reverts to the vanilla softmax with T=1. But It converges to logits when: (1) the temperature T is high compared with the magnitude of the logits, (2) the logits are zero-meaned. This is mathematically supported in the next figure.

With high-temperature T, distillation using softmax probabilities is equivalent to minimizing the distance between logits (z,v). Notice the gradients dC/dz and dL/dz are equivalent, the missing constant (1/NT²) is just a scaling factor.

The high/low-temperature tradeoff:

Using vanilla logits, or the distillation softmax, for knowledge distillation has pros and cons

High temperature pros: Using logits, or high temperature, directly retain latent knowledge about similarity (+ve correlation) between different classes. For example, in MINST dataset ‘2’ is more similar to ‘8’ than ‘1’, Thus, logits for digits ‘2’ and ‘8’ are highly correlated than logits of ‘2’ and ‘1’. The ‘2’ and ‘8’ logits are expected to rise together. This knowledge is beneficial and we want to transfer it.
High temperature cons come from logits with high negative values. These negative values could be either a result of noisy fluctuation or a clear distinction between two classes (-ve correlation). The question of whether this kind of knowledge should be transferred or not is left as an empirical question, it is not answered in the paper.

This means that the temperature parameter provides fine-control on the effect of highly negative logits. With high temperature, high negative logits impact the knowledge distillation loss. While a moderator temperature limits their impact. The paper suggests using an intermediate temperature when the distilled model is too small to capture the whole cumbersome model’s knowledge. This strongly suggests that ignoring the large negative logits can be helpful.

The next figure summarizes a quantitative evaluation on MNIST datasets using three models: cumbersome, small, distilled. The distilled model has the same architecture as the small model but employs the cumbersome model’s soft-labels during training. This fourth row illustrates how the number of errors decreases from 146 samples to 74 samples when distilling the cumbersome model at temperature T=20.

The rest of the paper provides experiments on speech recognition and an internet Google image dataset (JFT). I am not familiar with both, so they are omitted from this article.

My comments

1-The paper proposes a simple idea. I found the introduction, proposed formulation and the preliminary results sections well-written and presented in a very interesting way. This style reminds me of how papers should be written in a storytelling format.

2- The last few sections about speech recognition and JFT dataset are irrelevant to me. I have no background in speech recognition or Google internal JFT dataset. I think a lot of people will find it hard to relate to the JFT dataset experiments and it’s technical details.

3-This paper is published in 2015. It is a good starting point, along with[2], to get familiar with network distillation basis. Yet, a lot of further reading is required to keep up with the latest approaches of this problem.

Resources

[1]Distilling the Knowledge in a Neural Network

[2]Model compression

Distilling the Knowledge in a Neural Network

Written by Ahmed Taha