扬声器提取和共同语音的手势提示

论文标题

扬声器提取和共同语音的手势提示

Speaker Extraction with Co-Speech Gestures Cue

论文作者

Pan, Zexu, Qian, Xinyuan, Li, Haizhou

论文摘要

发言人提取旨在从多言语混合物演讲中提取目标扬声器的简洁演讲。有研究使用预录的语音样本或目标扬声器的面部图像作为扬声器提示。在人类交流中，自然而然地使用语音的共同语音手势也有助于言语感知。在这项工作中，我们探讨了共同语音手势序列的使用，例如手和身体运动，作为扬声器提取的扬声器提示，可以从低分辨率的录像中轻松获得，因此比面部录音更可用。我们提出了两个使用共同语音手势提示的网络，以对目标扬声器进行明显的聆听，该网络隐含地融合了扬声器提取过程中的共同语音手势提示，而另一个则首先进行语音分离，然后明确地使用共同语音的手势提示将语音分开的语音与目标扬声器相关联。实验结果表明，共同语音的手势提示与目标扬声器有关。

Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the co-speech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题