VCSE：时域视觉上下文扬声器提取网络

论文标题

VCSE：时域视觉上下文扬声器提取网络

VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

论文作者

Li, Junjie, Ge, Meng, Pan, Zexu, Wang, Longbiao, Dang, Jianwu

论文摘要

说话者提取旨在在给定辅助参考的多对话者场景中提取目标语音。这种参考可以是听觉的，即预先录制的语音，视觉，即唇部运动或上下文，即语音序列。不同方式的参考文献提供了独特的互补信息，可以融合，以在目标扬声器上形成自上而下的关注。先前的研究在单个模型中介绍了视觉和上下文方式。在本文中，我们提出了一个名为VCSE的两阶段时间域的视觉上下文扬声器提取网络，该网络逐个阶段结合了视觉和自我注册的上下文提示，以充分利用每种方式。在第一阶段，我们预先提取具有视觉提示的目标语音，并估计潜在的语音序列。在第二阶段，我们通过自我注册的上下文提示完善预提取的目标语音。现实世界中唇读句子3（LRS3）数据库的实验结果表明，我们提出的VCSE网络始终优于其他最先进的基线。

Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题