Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

Weakly supervised action segmentation learns to segment actions in long untrimmed videos. It leverages action transcript only as training labels. During training, the network has access to video features and the groundtruth sequence of actions. The network learns to recognize the label for every frame. For example, if a video has N frames spanning four actions, the network outputs N predictions; one action class per frame as shown in the following figure. Please note that this setting assumes a single action per frame, i.e., no overlapping actions. During inference, only video features are available and the network predicts the action for every frame.

The reason this weakly setting is important and gaining momentum is two-fold: (1) Collecting full temporal annotation is expensive and tedious, (2) Obtaining full temporal annotation is a challenging task for humans; hard to maintain consistency across multiple annotators.

The method from this paper is not the current state-of-the-art (SOTA) but it is very simple to understand and implement. That’s why It is a good starting point for someone interested in this computer vision problem. The current SOTA method[2] employs a dynamic time wrapping (Dynamic programming) based loss function which makes their approach complex and computationally expensive to train. On the other hand, this method employs the typical softmax loss with a simple estimation maximization (EM) algorithm.

The proposed estimation maximization (EM) algorithm works as follows; Given a video and the groundtruth action sequence, the estimation (E) step assigns every frame to one action class. At the first E step, this assignment is a simple uniform mapping. For example, if the video has N=20 frames and two action {take bowl, pour cereals}, then every 10 consequence frames are assigned a single action. This assignment is probably wrong, so a soft assignment procedure is employed as shown in the following figure.

After this soft assignment (E-step), a neural network is trained on this pseudo groundtruth to learn the action of every frame, i.e., a maximization (M) step. After training the neural network for 100 epochs, another estimation (E) step revises the previous pseudo groundtruth using the learned network predictions on the training split before another maximization (M) step starts. This estimation maximization process repeats as shown in the next figure until a stopping criterion is reached.

This idea has some limitations. Firstly, a degenerate initial estimation (pseudo groundtruth) can lead to a degenerate training. Secondly, a careful choice for the stopping criteria is required to avoid overfitting on the training split; Please remember that the neural network is training on the pseudo assignment using a typical softmax loss. The proposed stopping criteria is a video-level recognition loss. Since the groundtruth action transcript is known, a global max-pooling through time is utilized to get the maximal probability of each action in a video. The next figure shows a numerical example for computing the stopping criterion on a single video.

In this figure, the video has N=6 frames and there are K=6 action classes. Every row indicates the output per-frame predictions, i.e, every row should sum to 1. To compute P’, max-pooling is applied through time to estimate the probability of every action in this video. Then, a binary cross-entropy loss is evaluated against the ground-truth occurrence of actions as shown in the next equation.

Iterative Soft Boundary Assignment stopping criteria.

Where P’ is the result of max-pooling the network output (N, K) over time (N-frames) and K is the number of action classes. This is the multi-label loss for solving a multi-label classification problem, where every video has L labels (L=number of actions in a video). This loss makes sure that pooling the per-frame predictions over time will result in the correct actions in the transcript.

To train on long untrimmed videos, a network with long temporal modeling is required. The next Figure shows the proposed architecture.

Structure overview of TCFPN. The proposed network extends the original ED-TCN by adding lateral connections between encoder and decoder.

There are three core components packed in this architecture

  • Using temporal convolution (TC) layers to model short temporal order between actions
  • Using encoder-decoder architecture to stretch the receptive field of the temporal convolution layers. A technical workaround to improve long temporal modeling
  • Employing feature pyramid network (FPN) to integrate low and high-level predictions into the final network output.

While the proposed architecture can be improved with better components, it’s simplicity makes it very fast to train. The network takes the pre-computed breakfast frame features from [6] as input and learns the pseudo groundtruth during each iteration.

The next table shows a quantitative evaluation for the proposed iterative approach on two datasets: Breakfast and Hollywood extended.

The same network architecture can be used in a fully supervised setting where the groundtruth per frame annotation is available during training. The next table presents a fully supervised quantitative evaluation

The next figure presents a qualitative evaluation on two videos to emphasize how the pseudo groundtruth gets updated during consequence training iterations. The pseudo groundtruth gets better during training (Left) which in turn improve the segmentation quality during inference (Right).

My Comments:

  • The paper’s code is available on Github ; authors deserve a clap :). I think it is important to promote code release in the research community. The code uses Keras based on Tensorflow. It is straightforward to convert the code to pure TensorFlow and use it. The training time is small (~ 2hours). The code is simple to understand and corresponds well with the paper formulation
  • I prefer the formulation proposed in this paper over complex dynamic time-wrapping (dynamic programming) approaches [2,3]. It is much simpler to understand, implement and train.
  • One thing I don’t like about this approach is the output format. The action prediction-per-frame output can lead to a degenerate solution where an action spans a single frame. That’s why I prefer approaches similar to Souri et al.[4]
  • I wish the paper utilized a better long temporal modeling architecture like using dilated convolution as in [5]


[1] Weakly-Supervised Action Segmentation withIterative Soft Boundary Assignment

[2]D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

[3]NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning

[4]Weakly Supervised Action Segmentation Using Mutual Consistency

[5]MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

[6] The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities

I write reviews on computer vision papers. Writing tips are welcomed.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store