Unsupervised Learning of Spatiotemporally Coherent Metrics

Slow feature analysis (SFA) [1] assumes adjacent video frames contain semantically similar information. Thus adjacent frames’ feature, z_t and z_t-1, should be close, i.e features changes slowly. While different frames should have different features.

SFA original loss function

While intuitive, the originally proposed SFA lacks two explicit constraints: (1) How to avoid the trivial constant (degenerate) solution, (2) How to promote discriminative features learning for further usage. The trivial constant problem means the network predict the same feature vector (z) for every input/frame. Thus z= NN(x)= constant for all x. To avoid the trivial constant solution, Hadsel et al. [2] propose a slightly different loss function

Improved SFA loss function that avoid the constant solution

By sampling both similar and different inputs, this loss function avoids the constant solution. Avoiding the constant solution is necessary but not sufficient for the learned features to be useful. If the features are not discriminative, they are useless.

In this paper, Goroshin et al. [3] propose an auto-encoder network for SFA. Beside SFA original loss function, the learned feature vectors, h_t and h_t-1 in the diagram, are used to reconstruct the original input. Such new constraint promotes discriminative features and avoids the constant solution. The pooling layers impose spatialtemporal invariance. Beside the pooling and the decoder, [3] adds a regularizer to the loss function to promote sparse feature representation.

The Final proposed Loss function


[1] Slow feature analysis: Unsupervised learning of invariances

[2] Dimensionality Reduction by Learning an Invariant Mapping

[3] Unsupervised Learning of Spatiotemporally Coherent Metrics

I write reviews on computer vision papers. Writing tips are welcomed.