Action Recognition Using Visual Attention

This paper employs soft attention for action recognition. Soft image feature attention is proposed in [1] for image caption generation. The idea of hard attention is presented in this post.

The proposed idea is quite simple and straight forward. First, individual video frames are fed into GoogleNet to generate image features (K x K x D). Each location in the K x K feature map correspond to different location in the original image. Basically, each vertical slice is a feature for one image location. Soft Attention is a weighted average for these features from different locations.

The weighted features (l x X) are fed to a multi-layer LSTM for action prediction.

Unlike [1], where the weights are function of both the current feature and previous LSTM hidden state, this paper uses the hidden state only. It is interesting to know why the current features are omitted.

Weight computation comparison between [1] and this paper.

To classify a video, the sequence of generated LSTM outputs (y) are averaged. The results of this approach, shown below, are humble. Yet, it’s interpretability is an advantages that is required in some applications.

Quantitative evaluation with baseline
Quantitative evaluation with state-of-the-art approaches
Where the network is looking

References

[1] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

I write reviews on computer vision papers. Writing tips are welcomed.

I write reviews on computer vision papers. Writing tips are welcomed.