ImageNet is a great source of annotated images. Its large annotated corpus is used daily to train neural networks. Yet, data annotation on a large scale is infeasible for other problems like image segmentation, object detection or action recognition. Medical data is another example that require medical technicians for annotation which can be expensive. This paper proposes an unsupervised approach for object detection, classification and segmentation.
Data from sequential video frames are rich of information. Spatial-temporal relationships can be drawn without explicit annotation of objects in these frames. Shuffle & learn, O3N, OPN use temporal sequence to train neural networks for action recognition. This paper proposes an unsupervised motion-based segmentation on videos to obtain segments, which they use as ‘pseudo ground truth’ to train a convolutional network to segment objects from a single frame.
Low level cues like edges, color and texture can lead to incorrect pixel grouping. Motion helps correctly group pixels that move
together and identify this group as a single object. Thus, using a couple of frames, motion cues can segment objects in videos without any supervision. This pseudo segmentation is used as labels to train the ConvNet in this paper.
Despite being noisy, the paper argues that the noisy pseudo segmentation have minor effect on the neural network performance. To support such hypothesis, a ConvNet is first trained with ground-truth segmentations from COCO dataset. Then trained again with a systematically degraded ground-truth to check the performance difference. To degrade the groundtruth segmentation, both boundary noise, using morphological kernels, and truncation are introduced as shown in the figure below.
The noisy segmentation has a small effect on the network mean average precision as shown in the figure below.
Network example outputs are presented below. Even with noisy segmentation labels (second column), the network output (third column) is better.
The proposed ConvNet is evaluated on multiple tasks: object detection, Image classification, action classification and semantic segmentation. It is superior to other unsupervised approaches. Yet, there is a clear gap between the proposed unsupervised ConvNet and supervised approaches using ImageNet. The paper claims that the used dataset is small compared to ImageNet. Thus, despite the availability of lot of frames to train from a video, these frames are very correlated.
The proposed approach is definitely interesting and useful, yet its effectiveness is vague when the camera is moving — every pixel in the image is moving. Can it be used with ego-motion videos, when the main character is invisible but multiple agents move independently?