Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Approach Overview: In step (2), image features are captured at lower convolutional layers. In step (3), a feature is sampled, fed to LSTM to generate corresponding word. Step 3 is repeated K times to generate K-words caption
Left: At time t, Features a_t at locations s_t are scattered throughout the image. Right: Given alpha (feature weights) at time t, features, in red, are more probable to be sampled
From stackexchange

--

--

I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.