Deep learning approaches train a single network using a large corpus of images. However, this paper proposes to train a network per image to generate images’ priors. A prior is an underlying assumption we have about the world. For example, we assume a coin to be fair (50% heads and 50% tails); that’s our prior. This prior is not always true, but most of the time, it is. Similarly, we assume natural images to be noise and holes free. Thus, this paper proposes a deep image prior idea for denoising and inpainting applications. The paper argues against the belief that supervised learning is necessary for building good image priors. They show that the generator network architecture captures a great deal of image statistics.

The next figure illustrates the main idea. Given a noisy image x, a convolution neural network (e.g, U-Net) is optimized using gradient descent to generate the prior of the noisy image — a denoised version x*. The input to the neural network is a fixed 3D tensor Z. The input tensor has 32 feature maps and of the same spatial dimension as x.

Deep Image Prior employs a U-Net architecture to denoise and inpaint images. The U-Net’s weights are optimized using gradient descent to generate the prior image x*.

The weights of the network are optimized using vanilla gradient descent to minimize the following loss function

The loss function for image prior generation.

where x is the noisy image and x* is the generated denoised image — the prior. When training a neural network, we tend to seek a global minimum. The global minimum, for this loss function, means regenerating a noisy image, i.e., L=0 when x* = x. This is expected due to the neural network's huge overfitting capability. To avoid this global minimum, the paper terminates the optimization process early. It is argued that before reaching the global minimum solution, the generated image x* will either converge to a good-looking local optimum or, at least, the optimization trajectory passes near one.

This argument raises a critical question: when to terminate or what is the termination criteria? This question is not answered in the paper. Early stopping is adapted which is not a concrete solution. The next figure shows the impact of the number of optimization iterations on the generated image prior x*. It shows how a nice-looking local optimum is reached after 2400 iterations before the network overfits the corrupted image. Fortunately, a follow-up paper[2] tackles the termination criteria limitation; I will elaborate on this limitation at the end of the article.

Our approach can restore an image with a complex degradation (JPEG compression in this case). As the optimization process progresses, the deep image prior allows to recover most of the signal while getting rid of halos and blockiness (after 2400 iterations) before eventually overfitting to the input (at 50K iterations).

Another challenge for this approach is computational complexity. According to the paper, it several minutes of GPU computation per image. That’s why this approach is evaluated using a noisy image dataset containing only nine (9) images[3] — this is CVPR2018 paper :). To be fair, there are more evaluations using image inpainting application.

One of the core advantages of this idea is that no labels are required. It is a kind of unsupervised learning approach that outperforms its supervised alternative. The next figure compares the image prior approach against a supervised network for image inpainting. The supervised Global-Local GAN is less computationally expensive, if we disregard the initial training cost, but the final results are comparable.

In many cases, deep image prior is sufficient to successfully inpaint large regions. Despite using no learning, the results may be comparable to [15] which does. The choice of hyper-parameters is important (for example (d) demonstrates sensitivity to the learning rate), but a good setting works well for most images we tried.

The authors reaffirm multiple times that the generated prior image’s quality is dependent on the network architecture. The next figure presents a qualitative evaluation of multiple architectures. It is suggested that having deeper architecture is beneficial and that having skip-connections is highly detrimental.

Inpainting using different depths and architectures. The figure shows that much better inpainting results can be obtained by using deeper random networks. However, adding skip connections to ResNet in U-Net is highly detrimental.

To tackle the aforementioned early stopping limitation and avoid fixing the number of optimization iterations, Cheng el at. [2] propose a Bayesian approach. The big picture proposed is to (1) enforce a prior on the network parameter using weight decay; (2) generate x* by integrating the posterior as follows

Integrating the posterior (weighted average) to generate the final result x* without early stopping

In simple words, imagine you generate multiple image priors; these priors are not identical, some have higher chances than others. To aggregate these samples, a Bayesian approach integrates them in a weighted average fashion using their probabilities. Sampling multiple image priors is computationally expensive. Thus, Stochastic gradient Langevin dynamics (SGLD) is employed, instead of Stochastic gradient descent SGD, to avoid actual Monte Carlo (MC) sampling. SGLD provides a general framework to derive an MCMC sampler by injecting Gaussian noise to the gradient updates. Basically, generate the prior using SGD but inject Gaussian noise in the gradient while employing weight decay.

My Comments

Resources

[1]Deep Image Prior

[2]A Bayesian Perspective on the Deep Image Prior

[3] http://www.cs.tut.fi/~foi/GCF-BM3D/index.html#ref_results

I write reviews on computer vision papers. Writing tips are welcomed.