Deep learning approaches train a single network using a large corpus of images. However, this paper proposes to train a network per image to generate images’ priors. A prior is an underlying assumption we have about the world. For example, we assume a coin to be fair (50% heads and 50% tails); that’s our prior. This prior is not always true, but most of the time, it is. Similarly, we assume natural images to be noise and holes free. Thus, this paper proposes a deep image prior idea for denoising and inpainting applications. The paper argues against the belief that supervised learning is necessary for building good image priors. They show that the generator network architecture captures a great deal of image statistics.
The next figure illustrates the main idea. Given a noisy image x, a convolution neural network (e.g, U-Net) is optimized using gradient descent to generate the prior of the noisy image — a denoised version x*. The input to the neural network is a fixed 3D tensor Z. The input tensor has 32 feature maps and of the same spatial dimension as x.
The weights of the network are optimized using vanilla gradient descent to minimize the following loss function
where x is the noisy image and x* is the generated denoised image — the prior. When training a neural network, we tend to seek a global minimum. The global minimum, for this loss function, means regenerating a noisy image, i.e., L=0 when x* = x. This is expected due to the neural network's huge overfitting capability. To avoid this global minimum, the paper terminates the optimization process early. It is argued that before reaching the global minimum solution, the generated image x* will either converge to a good-looking local optimum or, at least, the optimization trajectory passes near one.
This argument raises a critical question: when to terminate or what is the termination criteria? This question is not answered in the paper. Early stopping is adapted which is not a concrete solution. The next figure shows the impact of the number of optimization iterations on the generated image prior x*. It shows how a nice-looking local optimum is reached after 2400 iterations before the network overfits the corrupted image. Fortunately, a follow-up paper tackles the termination criteria limitation; I will elaborate on this limitation at the end of the article.
Another challenge for this approach is computational complexity. According to the paper, it several minutes of GPU computation per image. That’s why this approach is evaluated using a noisy image dataset containing only nine (9) images — this is CVPR2018 paper :). To be fair, there are more evaluations using image inpainting application.
One of the core advantages of this idea is that no labels are required. It is a kind of unsupervised learning approach that outperforms its supervised alternative. The next figure compares the image prior approach against a supervised network for image inpainting. The supervised Global-Local GAN is less computationally expensive, if we disregard the initial training cost, but the final results are comparable.
The authors reaffirm multiple times that the generated prior image’s quality is dependent on the network architecture. The next figure presents a qualitative evaluation of multiple architectures. It is suggested that having deeper architecture is beneficial and that having skip-connections is highly detrimental.
To tackle the aforementioned early stopping limitation and avoid fixing the number of optimization iterations, Cheng el at.  propose a Bayesian approach. The big picture proposed is to (1) enforce a prior on the network parameter using weight decay; (2) generate x* by integrating the posterior as follows
In simple words, imagine you generate multiple image priors; these priors are not identical, some have higher chances than others. To aggregate these samples, a Bayesian approach integrates them in a weighted average fashion using their probabilities. Sampling multiple image priors is computationally expensive. Thus, Stochastic gradient Langevin dynamics (SGLD) is employed, instead of Stochastic gradient descent SGD, to avoid actual Monte Carlo (MC) sampling. SGLD provides a general framework to derive an MCMC sampler by injecting Gaussian noise to the gradient updates. Basically, generate the prior using SGD but inject Gaussian noise in the gradient while employing weight decay.
- The paper is well written and provides a different perspective on deep learning methods. The idea is simple; so it is an easy and nice paper to read.
- Both  &  implementations are released on Github.
- I wish the authors elaborated more on computational complexity. “taking several minutes of GPU computation per image” is a vague wording. does it take 3 or 20 seconds?
- From my perspective, the main limitation of this paper is not the computational complexity but when to terminate the optimization? I am glad the reviewers didn’t reject the paper for this limitation. Fortunately, this issue is addressed in a follow-up paper.