A Real time Action Prediction Framework by Encoding Temporal Evolution for Assembly Tasks

Ahmed Taha
2 min readNov 23, 2017

This paper proposes a pipeline to predict future actions. Unlike previous works that predict next RGB frame using recurrent neural network, this paper predicts next dynamic image. Future actions are then recognized by feeding a sequence of the predict dynamic images to ResNet.

Dynamic images seem like optical flow. Dynamic image captures motion in multiple frames. It resembles optical flow that capture motion between two consecutive frames. Yet, dynamic images represent motion in multiple frames as shown in the figure.

Predict Network (PredNet) and ResNet152 are sequentially used to predict actions. The suggested pipeline is summarized in this figure below.

First, dynamic images are generated from a sequence of raw RGB images. The generated dynamic images are fed to the PredNet — a recurrent LSTM CNN. PredNet predicts the future dynamic image. A sequence of predicted dynamic images are fed to ResNet to recognize the future action.


  • Dataset partitioning is usually 60%, 10% and 30% for training, validation and testing. In this paper, the partitioning is weird, like 40 (90%) videos for training, 2 validation (5%),2 testing (5%)
  • Background pixels cause a huge bias while training. This was addressed by adding white noise which improved accuracy. I am not sure how white noise parameters are tuned if any. The ratio of background to foreground pixels can improve accuracy by assigning different error weights for background or foreground pixels. Randomly sampling background pixel and assigning high error weights is another approach.
  • Baseline felt weak especially for action prediction. IKEA videos contain five actions only. A baseline predictor, predicting same action as previous dynamic image, achieves around 75%. This means switching between actions is relatively small in these videos. So future prediction, across the whole video, is weak metric. I wish accuracy was normalized by the amount of action switching in the video. Not sure about this point but normalization would have been helpful in evaluation.


This paper presents a pipeline for future action prediction. Dynamic images can encode motion for further action recognition.