A Real time Action Prediction Framework by Encoding Temporal Evolution for Assembly Tasks

This paper proposes a pipeline to predict future actions. Unlike previous works that predict next RGB frame using recurrent neural network, this paper predicts next dynamic image. Future actions are then recognized by feeding a sequence of the predict dynamic images to ResNet.

Dynamic images seem like optical flow. Dynamic image captures motion in multiple frames. It resembles optical flow that capture motion between two consecutive frames. Yet, dynamic images represent motion in multiple frames as shown in the figure.

Predict Network (PredNet) and ResNet152 are sequentially used to predict actions. The suggested pipeline is summarized in this figure below.

First, dynamic images are generated from a sequence of raw RGB images. The generated dynamic images are fed to the PredNet — a recurrent LSTM CNN. PredNet predicts the future dynamic image. A sequence of predicted dynamic images are fed to ResNet to recognize the future action.


  • Dataset partitioning is usually 60%, 10% and 30% for training, validation and testing. In this paper, the partitioning is weird, like 40 (90%) videos for training, 2 validation (5%),2 testing (5%)


This paper presents a pipeline for future action prediction. Dynamic images can encode motion for further action recognition.

I write reviews on computer vision papers. Writing tips are welcomed.

I write reviews on computer vision papers. Writing tips are welcomed.