Unsupervised Learning of Spatiotemporally Coherent Metrics
Slow feature analysis (SFA) [1] assumes adjacent video frames contain semantically similar information. Thus adjacent frames’ feature, z_t and z_t-1, should be close, i.e features changes slowly. While different frames should have different features.
While intuitive, the originally proposed SFA lacks two explicit constraints: (1) How to avoid the trivial constant (degenerate) solution, (2) How to promote discriminative features learning for further usage. The trivial constant problem means the network predict the same feature vector (z) for every input/frame. Thus z= NN(x)= constant for all x. To avoid the trivial constant solution, Hadsel et al. [2] propose a slightly different loss function
By sampling both similar and different inputs, this loss function avoids the constant solution. Avoiding the constant solution is necessary but not sufficient for the learned features to be useful. If the features are not discriminative, they are useless.
In this paper, Goroshin et al. [3] propose an auto-encoder network for SFA. Beside SFA original loss function, the learned feature vectors, h_t and h_t-1 in the diagram, are used to reconstruct the original input. Such new constraint promotes discriminative features and avoids the constant solution. The pooling layers impose spatialtemporal invariance. Beside the pooling and the decoder, [3] adds a regularizer to the loss function to promote sparse feature representation.
References
[1] Slow feature analysis: Unsupervised learning of invariances
[2] Dimensionality Reduction by Learning an Invariant Mapping
[3] Unsupervised Learning of Spatiotemporally Coherent Metrics