The Sound of Pixels

“Which pixels are making sounds?” Energy distribution of sound in pixel space. Overlaid heatmaps show the volumes from each pixel.
Dataset Statistics: a) Shows the distribution of video categories. There are 565 videos of solos and 149 videos of duets. b) Shows the distribution of video durations. The average duration is about 2 minutes.
The Video analysis network samples three frames from a video and generates a 3D tensor output (green cuboid). However, during training, an extra spatial pooling is applied on the 3D tensor to convert it into a 1D vector.
The Audio analysis network uses Short Time Fourier transform (STFT) to convert the 1D input audio wave into a 2D spectrogram. Then, an audio U-Net splits the spectrogram into K audio channels.
The audio-synthesizer network has two inputs, i.e., the outputs of the video-analysis and audio-analysis networks. The audio-synthesizer learns a spectrogram mask that assigns a sound to a particular pixel. Finally, the inverse Fourier transform generates an output audio signal per mask.
PixelPlayer is trained to separate the combined audio signal (S1 + S2) back into two independent signals (S1, S2), where S1 and S2 are the audio signals of the first and second video, respectively.
Qualitative results on vision-guided source separation on synthetic audio mixtures. This figure employs the training, not the testing, setup. This experiment is performed only for quantitative model evaluation.
“What sounds do these pixels make?” Clustering of sound in space. Overlaid colormap shows different audio features with different colors.
  • This paper requires elementary audio-processing background (e.g., spectrogram). I am interested in unsupervised/self-supervised approaches but with limited experience working with audio. So, this paper provides a smooth introduction to unsupervised approaches in the joint audio-video domain.
  • The paper is a bit confusing due to the differences between the training and testing setups. The audio-synthesizer network takes different input dimensions during training (1D video feature) and testing (3D video feature)! So, I am surprised the audio-synthesizer network works as presented.
  • The aggressive usage of temporal and spatial pooling for the video features seems to work because the videos have static scenes. I don’t think PixelPlayer generalizes to more complex videos. On another hand, this 2018 paper is an early evaluation on the joint audio-video domain. I am sure there is more advanced followup literature.
  • The main thing I like about this paper is the idea of mixing and separating audio files. The authors leverage the fact that audio signals are approximately additive. I wonder, is it possible to apply the mix-and-separate idea on images/videos?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Taha

Ahmed Taha

I write reviews on computer vision papers. Writing tips are welcomed.