This paper proposes PixelPlayer, a system to ground audio inside a video (frames) without manual supervision. Given an input video, PixelPlayer separates the accompanying audio into components and spatially localizes these components in the video. PixelPlayer enables us to listen to the sound originating from each pixel in the video as shown in the next Figure.
To train PixelPlayer using a neural network, a dataset is needed. The authors introduce a musical instrument video dataset for the proposed task, called MUSIC (Multimodal Sources of Instrument Combinations) dataset. This dataset is crawled from Youtube but with no manual annotation. MUSIC dataset has 714 untrimmed videos of musical solos and duets. The next Figure shows the dataset statistics. In the dataset’s videos, the source-pixels of audio are not manually labeled. PixelPlayer is trained to learn these source-pixels with a self-supervision trick.
The Pixelplayer has three main networks: (1) Video analysis network, (2) Audio analysis network, and (3) Audio synthesizer network. For each network, I will illustrate the testing setup, then highlight what is different during training. It is important to note that the inputs to the audio-synthesizer network have different dimensions during testing and training.
Video Analysis Network: Given a video, three frames are sampled. The ResNet model extract per-frame features with size T×(H/16)×(W/16)×K, where T=3 is the number of frames, H and W are the frame height and width and K is the number of output channels. The video-analysis network temporally pools the ResNet features and outputs a 3D tensor with (H/16)×(W/16)×K dimensions as shown in the next Figure. During testing, this 3D tensor is directly fed into the Audio synthesizer network. However, during training, an extra spatial pooling is applied. Thus, the 3D tensor collapses into a 1D vector with K dimensions during training.
Audio Analysis Network: Given an input audio file (1D), it is converted into a spectrogram (2D). Then, the spectrogram is fed into an audio U-Net to split the audio signal into K components (3D) as shown in the next Figure. During testing, the input audio (S) comes from a single video. However, during training, the input audio (S) combines different audio singles from different videos to generate a complex audio input signal.
Audio Synthesizer Network: Given the video-analysis and audio-analysis networks’ outputs, the audio-synthesizer learns a mask to be applied to the input spectrogram. The mask selects the spectral components associated with each pixel (video-analysis output). Finally, inverse STFT is applied to the masked spectrogram, corresponding to each pixel, to produce the final sound as shown in the next figure.
The next figure shows the three networks and highlights the main differences between the testing and training setup. PixelPlayer is trained to separate the combined audio signal (S1 + S2) back into two independent signals (S1, S2), where S1 and S2 are the audio signals of the first and second video, respectively. By training on this self-supervised trick, PixelPlayer learns to ground audio inside a video (frames/pixels) without manual supervision.
The next figure depicts the process of mixing two audio signals into a single spectrogram and then using the binary output mask to separate the audio signal.
PixelPlayer can answer two questions given a joint audio-video signal: (1)Which pixels are making sounds?, and (2) What sounds do these pixels make? The first Figure in this article shows how PixelPlayer answers the first question. The next Figure shows how PixelPlayer answers the second question.
This article presents qualitative evaluations only. For quantitative evaluation on the output audio signal, please refer to the paper. The following video presents extra vivid qualitative evaluation with their audio output signal.
- This paper requires elementary audio-processing background (e.g., spectrogram). I am interested in unsupervised/self-supervised approaches but with limited experience working with audio. So, this paper provides a smooth introduction to unsupervised approaches in the joint audio-video domain.
- The paper is a bit confusing due to the differences between the training and testing setups. The audio-synthesizer network takes different input dimensions during training (1D video feature) and testing (3D video feature)! So, I am surprised the audio-synthesizer network works as presented.
- The aggressive usage of temporal and spatial pooling for the video features seems to work because the videos have static scenes. I don’t think PixelPlayer generalizes to more complex videos. On another hand, this 2018 paper is an early evaluation on the joint audio-video domain. I am sure there is more advanced followup literature.
- The main thing I like about this paper is the idea of mixing and separating audio files. The authors leverage the fact that audio signals are approximately additive. I wonder, is it possible to apply the mix-and-separate idea on images/videos?