Slow and steady feature analysis: higher order temporal coherence in video

This paper proposes a novel regularizer for semi-unsupervised image embedding. Originally, slow feature analysis (SFA) suggests that feature representation for temporally close frames exhibit small difference. Thus, the distance Dist1 (Feature Vecture 2- Feature Vecture 1) should be small. SFA forces slow feature representation change between nearby frames. To avoid the trivial solution, Fect Vect 1 = Feat Vec 2 = 0, a contrastive loss function is proposed. SFA lacks future predictability; meaning Dist 1 and Dist 2 are expect to be small, yet, no relationship couples Dist 1 and Dist 2.

The steady slow feature analysis (SSFA), proposed in this paper, impose such predictability constraint. So, on top of Dist 1 and Dist 2 being small, Dist 1 approximates Dist 2 and vice versa. Such regularizer imposes 2nd order temporal coherency.

The novel regularizer is evaluated through recognition and sequence completion tasks. In sequence completion, given two frames, their corresponding feature vectors FV1, FV2 are computed.The third frame is predicted from a large pool of candidate images. Since FV2 - FV1 (Dist 1) = FV3 -FV2 (Dist 2), the third frame is the frame with FV3 = 2 * FV2- FV1.

Given two query frames, A pool of candidate frames are sorted, according to SSFA constraint, and the top three candidate frames are returned.

In recognition experiments, multiple loss functions are used to train a CNN network. The following losses are evaluated: L1 is pure supervised loss (SUP), L2 is SUP + SFA without contrastive, L3 for SUP + SFA with contrastive and finally L4 for SUP + SFA + SSFA constraints. L4, using SSFA constraint, outperforms the other loss functions.

I write reviews on computer vision papers. Writing tips are welcomed.