从视频中对视听对象进行自我监督的学习

论文标题

从视频中对视听对象进行自我监督的学习

Self-Supervised Learning of Audio-Visual Objects from Video

论文作者

Afouras, Triantafyllos, Owens, Andrew, Chung, Joon Son, Zisserman, Andrew

论文摘要

我们的目标是使用自我监督的学习将视频转换为一组离散的视听对象。为此，我们介绍了一种模型，该模型将注意力进行本地化和组声源，以及随着时间的推移汇总信息。我们证明了音频对象嵌入的有效性，我们的模型通过将其用于四个下游语音为导向的任务来学习：（a）多演讲者的声源分离，（b）校正误差的音频数据，以及（d）校正误解的音频数据，以及（d）（d）（d）活跃的扬声器检测。使用我们的表示形式，可以通过无标记的视频培训在没有对象探测器的情况下完全解决这些任务。我们还通过将其应用于非人类扬声器（包括卡通和木偶）来证明我们的方法的普遍性。我们的模型大大优于其他自我监督的方法，并使用使用监督的面部检测的方法获得了性能竞争力。

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题