从视频中产生视觉对齐的声音

论文标题

从视频中产生视觉对齐的声音

Generating Visually Aligned Sound from Videos

论文作者

Chen, Peihao, Zhang, Yang, Tan, Mingkui, Xiao, Hongdong, Huang, Deng, Gan, Chuang

论文摘要

我们专注于从自然视频中产生声音的任务，声音应在时间和内容上与视觉信号保持一致。此任务极具挑战性，因为产生的某些声音\ emph {outher}无法从视频内容中推断出相机。该模型可能被迫学习视觉内容与这些无关的声音之间的错误映射。为了应对这一挑战，我们提出了一个名为Regnet的框架。在此框架中，我们首先从视频帧中提取外观和运动功能，以更好地区分发出声音与复杂背景信息的对象。然后，我们介绍了一个创新的音频转发正规器，该正规调整器直接将真实的声音视为输入，并输出瓶颈声音功能。在训练期间，使用视觉和瓶颈声音功能进行声音预测，为声音预测提供了更强的监督。音频转发正规器可以控制不相关的声音组件，从而阻止模型在视频帧之间学习不正确的映射，并且声音在屏幕外发出的声音。在测试过程中，删除音频转发器，以确保Regnet只能从视觉特征中产生纯粹的对齐声音。基于亚马逊机械土耳其人的广泛评估表明，我们的方法显着改善了时间和内容的一致性。值得注意的是，我们产生的声音可以以68.12％的成功率欺骗人类。代码和预培训模型可在https://github.com/peihaochen/regnet上公开获得

We focus on the task of generating sound from natural videos, and the sound should be both temporally and content-wise aligned with visual signals. This task is extremely challenging because some sounds generated \emph{outside} a camera can not be inferred from video content. The model may be forced to learn an incorrect mapping between visual content and these irrelevant sounds. To address this challenge, we propose a framework named REGNET. In this framework, we first extract appearance and motion features from video frames to better distinguish the object that emits sound from complex background information. We then introduce an innovative audio forwarding regularizer that directly considers the real sound as input and outputs bottlenecked sound features. Using both visual and bottlenecked sound features for sound prediction during training provides stronger supervision for the sound prediction. The audio forwarding regularizer can control the irrelevant sound component and thus prevent the model from learning an incorrect mapping between video frames and sound emitted by the object that is out of the screen. During testing, the audio forwarding regularizer is removed to ensure that REGNET can produce purely aligned sound only from visual features. Extensive evaluations based on Amazon Mechanical Turk demonstrate that our method significantly improves both temporal and content-wise alignment. Remarkably, our generated sound can fool the human with a 68.12% success rate. Code and pre-trained models are publicly available at https://github.com/PeihaoChen/regnet

下载PDF全文

下载文献需遵守相关版权规定

论文标题