Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Given an image, the proposed CNN-LSTM network generates image captions. To capture multiple objects inside an image, features are extracted from the lower convolutional layers unlike previous work which use the final fully connected layer. Thus a single image is represented by multiple features a_t at different locations s_t.
The LSTM is trained in a sequence to sequence manner, where a feature a_t, at time t, from location s_t is sampled and fed to LSTM to generates a word. This process is repeated K times to generate K-words image caption.
In this article, I will focus on the Stochastic “Hard” Attention because the soft variant is trivial. I focus on the mathematical formulation of the paper because it is interesting. Thus, the results and computer vision contributions are omitted. In “hard” attention, the location s_t, with feature a_t, is sampled from a multi-nomial distribution defined by parameter alpha. The parameter alpha is learned using a function of the LSTM hidden state h and the image features a — f_att(a,h).
This nice idea is intuitive because at each time step, the LSTM is fed a feature from different image location to generate the corresponding word as shown below
Sampling from a multi-nomial is a trivial process. For example, in the figure below, at time t, location S_1 is more probable than S_3 and S2 and so on.
To imagine how sampling is done at time t, think of image features (a_t) as scattered dot throughout the whole image — blue dots. Given alpha, at time t, some features are more probable to be sampled and fed to the LSTM. For example the red features are more probable than the blue ones in the figure below
Yet, sampling within a neural network prevent end-to-end training. Basically the sampling node is random undifferentiable node, so back propagation is infeasible. Reparameterization trick is a typical workaround. Basically, a differentiable surrogate function for sampling is learned. The figure below explains this idea. Instead of random node z, we learn a differentiable function z.
In this paper, the new learnable function is L_s. It is a function of the features a and their locations s , f(s,a), that maximize the probability of the image caption y.
The gradient of this function relative to parameters W in shown the the next figure. Monte Carlo simulation is used to estimate this gradient by substituting with a sampled s_t.
A moving average baseline is used to avoid noisy (jumpy) gradient at each batch.
The soft attention variants is basically a weighted summation of all features a, using alpha as weights. That is why a good understanding for the hard variant will make the soft variant trivial.