Temporal Context Network for Activity Localization in Videos

This paper proposes a Temporal Context Network (TCN) for activity detection. The main contribution is showing that activity context improves activity detection accuracy. Similar to Faster R-CNN, the pipeline is divided into three steps: proposal generation, object classification and bounding box refinement. Before explaining these steps, lets first let define video activity as a video segment between (b,e) where b and e denote the beginning and end of a segment. Every activity contains one or more actions or events; an event contains multiple actions.

To generate proposals, an untrimmed video is divided into M segments with 50% overlap — each containing L frames. For each segment K=20 proposals are generated, between (b,e), as shown in the figure

After proposals generation, feature representation for ranking proposals is required. An untrimmed video frames are sampled at rate m = T * 2/fps, where T is the number of frames, fps is the number of frames per second and 2 is a hyper parameter. Thus a video feature vector F = {f_1,f_2,……,f_m}.

Using the video feature vector F, a proposal feature vector is constructed by sampling n frame features within the proposal segment (b,e). A proposal is represented by a feature vector Z_{i,k} = {z_1,z_2,…….,z_n}, where i in the proposal index and k is the temporal scale.

To detect an activity, a pair of proposal features Z_{i,k}, Z_{i,k+1}, from two consecutive scales are fed to a Temporal CNN (TCN) as shown below

After applying TCN, the proposals features vectors are concatenated, then fed to a full connected layer to compute activity detection loss. On parallel, a similar classification pipeline classifies the activity. Unlike detection pipeline, only one proposal is used to classify activity. This figure summarize the whole neural network


  • The network consider temporal proposals but no spatial proposals. This probably hinder its ability to detect action in the background.
  • The author said the classification problem is more difficult. The network classification accuracy is less than the detection accuracy. This network process untrimmed video which is probably the reason for such results.
  • Not sure why a context proposal is not used for classification similar to detection? The main idea from this paper is that context matters.

I write reviews on computer vision papers. Writing tips are welcomed.

I write reviews on computer vision papers. Writing tips are welcomed.