论文标题
SEP-STEREO:通过关联源分离来视觉引导的立体音频产生
Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
论文作者
论文摘要
立体音频是增强人类听觉体验的必不可少成分。最近的研究探讨了视觉信息的用法,作为通过立体声监督从单声道中产生双耳或Ambisonic音频的指导。但是,这种完全监督的范式遭受了固有的缺点:立体音频的记录通常需要精致的设备,这些设备对于广泛的可访问性来说都是昂贵的。为了克服这一挑战,我们建议利用广泛的单声道数据来促进立体音频的产生。我们的主要观察结果是,视觉上指示的音频分离的任务还将独立的音频映射到其相应的视觉位置,该视觉位置与立体音频的产生共享相似的目标。我们通过将源分离作为特定类型的音频空间化,将立体声的生成和源分离都集成到统一的框架Sep-Stereo中。具体而言,一种新型的关联金字塔网络体系结构经过精心设计,用于视听特征融合。广泛的实验表明,我们的框架可以改善立体音频产生的结果,同时使用共享的主链进行准确的声音分离。
Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone.