Distilling the Knowledge in a Neural Network

Deep neural networks achieve impressive performance results but are computational expensive during both training and testing. Using a trained network on a mobile device for inference is a desirable feature to reduce server/network traffic and improve system scalability. This motivates the development of compact networks like MobileNet. Unfortunately, these compact networks lag significantly behind very-deep architectures. To reduce this performance gap, research has focused on two main research directions: network compression and network distillation.