Temporal Context Network for Activity Localization in Videos

This paper proposes a Temporal Context Network (TCN) for activity detection. The main contribution is showing that activity context improves activity detection accuracy. Similar to Faster R-CNN, the pipeline is divided into three steps: proposal generation, object classification and bounding box refinement. Before explaining these steps, lets first let define video activity as a video segment between (b,e) where b and e denote the beginning and end of a segment. Every activity contains one or more actions or events; an event contains multiple actions.

To generate proposals, an untrimmed video is divided into M segments with 50% overlap — each containing L frames. For each segment K=20 proposals are generated, between (b,e), as shown in the figure

After proposals generation, feature representation for ranking proposals is required. An untrimmed video frames are sampled at rate m = T * 2/fps, where T is the number of frames, fps is the number of frames per second and 2 is a hyper parameter. Thus a video feature vector F = {f_1,f_2,……,f_m}.

Using the video feature vector F, a proposal feature vector is constructed by sampling n frame features within the proposal segment (b,e). A proposal is represented by a feature vector Z_{i,k} = {z_1,z_2,…….,z_n}, where i in the proposal index and k is the temporal scale.

To detect an activity, a pair of proposal features Z_{i,k}, Z_{i,k+1}, from two consecutive scales are fed to a Temporal CNN (TCN) as shown below

After applying TCN, the proposals features vectors are concatenated, then fed to a full connected layer to compute activity detection loss. On parallel, a similar classification pipeline classifies the activity. Unlike detection pipeline, only one proposal is used to classify activity. This figure summarize the whole neural network

Comments:

I write reviews on computer vision papers. Writing tips are welcomed.

I write reviews on computer vision papers. Writing tips are welcomed.