Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

Weakly supervised action segmentation learns to segment actions in long untrimmed videos. It leverages action transcript only as training labels. During training, the network has access to video features and the groundtruth sequence of actions. The network learns to recognize the label for every frame. For example, if a video has N frames spanning four actions, the network outputs N predictions; one action class per frame as shown in the following figure. Please note that this setting assumes a single action per frame, i.e…