Action Recognition Using Visual Attention

This paper employs soft attention for action recognition. Soft image feature attention is proposed in [1] for image caption generation. The idea of hard attention is presented in this post.

The proposed idea is quite simple and straight forward. First, individual video frames are fed into GoogleNet to generate image features (K x K x D). Each location in the K x K feature map correspond to different location in the original image. Basically, each vertical slice is a feature for one image location. Soft Attention is a weighted average for these features from different locations.

The weighted features (l x X) are fed to a multi-layer LSTM for action prediction.

Unlike [1], where the weights are function of both the current feature and previous LSTM hidden state, this paper uses the hidden state only. It is interesting to know why the current features are omitted.

To classify a video, the sequence of generated LSTM outputs (y) are averaged. The results of this approach, shown below, are humble. Yet, it’s interpretability is an advantages that is required in some applications.

References

[1] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store