The proposed idea is quite simple and straight forward. First, individual video frames are fed into GoogleNet to generate image features (K x K x D). Each location in the K x K feature map correspond to different location in the original image. Basically, each vertical slice is a feature for one image location. Soft Attention is a weighted average for these features from different locations.
The weighted features (l x X) are fed to a multi-layer LSTM for action prediction.
Unlike , where the weights are function of both the current feature and previous LSTM hidden state, this paper uses the hidden state only. It is interesting to know why the current features are omitted.
To classify a video, the sequence of generated LSTM outputs (y) are averaged. The results of this approach, shown below, are humble. Yet, it’s interpretability is an advantages that is required in some applications.