Log p(y|a) is the main objective — generate the most probably caption/word given an image a.

To introduce attention, rewrite log p(y|a) into log sum[p(y|s,a) p(s|a)]. This is the Law of total probability [wiki]

So far no lower bound or variational introduced.

To simplify log sum[p(y|s,a) p(s|a)], it is *approximated* to sum[p(s|a) log(p(y|s,a))].

This comes from Jensen’s inequality, log function is concave ==> log (t x_1 + (1-t) x_2) >= t log(x_1) + (1-t) log(x_2). p(s|a) is a probability term that sum to 1, just like t

Thus, sum[p(s|a) log(p(y|s,a))] <= log sum[p(y|s,a) p(s|a)]

Now, L_s = sum[p(s|a) log(p(y|s,a))]

Now, we have lower bound but not the variational aspect yet.

Differentiate L_s, to enable back propagation provides eq. no.11

To reach eq. no12,

First, p(s|a) is unknown, It is substituted by 1/N (Notice the summation is through N, thus Sum (p(s|a)) ==>1)

Second, *Sample* an image part a_i from a Multinoulli distribution. This is where the variational aspect comes from.

For instance, an image has three parts a_1,a_2,a_3. They are assigned probabilities [0.45,0.4,0.15]

While a_1 has the highest probability, it is not guaranteed to be sampled. a_2 has slightly lower probability than a_1, so it has a good chance.

This sampling procedure is non-deterministic, thats where the variational aspect is introduced. Add to that the effect of the random node introduced by the reparameterization trick

Why don’t we just pick a_1, it has the highest probability??

* It is an advantage to train a network that generate slightly different caption for the same image; not totally different but not identical as well.

This way the network behave like human. They can write multiple similar captions for the same image.

So, the network should be able to generate something like

* A man playing with a green ball

* The man kick a green ball

I hope this is helpful.