Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Approach Overview: In step (2), image features are captured at lower convolutional layers. In step (3), a feature is sampled, fed to LSTM to generate corresponding word. Step 3 is repeated K times to generate K-words caption
Left: At time t, Features a_t at locations s_t are scattered throughout the image. Right: Given alpha (feature weights) at time t, features, in red, are more probable to be sampled
From stackexchange

--

--

--

I write reviews on computer vision papers. Writing tips are welcomed.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.

More from Medium

An Introduction to Convolutional Neural Network (CNN)

Introduction to PyTorch with Tutorial

Deep Learning for Sign Language Production and Recognition

Revisiting Classical Deep Learning Research Paper — ALEXNET

Image of Geoffrey Hinton, Ilya Sutskever, Alex krizhevsky