论文标题
L-SPEX:本地目标扬声器提取
L-SpEx: Localized Target Speaker Extraction
论文作者
论文摘要
扬声器提取旨在从辅助参考语音的情况下从多词器语音混合物中提取目标扬声器的声音。最近的研究表明,扬声器提取从目标扬声器的位置或方向上受益。但是,这些研究认为,目标扬声器的位置是事先知道或通过额外的视觉提示(例如面部图像或视频)检测到的。在本文中,我们建议对纯语音提示提取端到端的局部目标扬声器,这称为L SPEX。具体而言,我们设计了一个由目标扬声器嵌入的扬声器本地化器,以提取空间特征,包括目标扬声器的到达方向(DOA)和横梁成式输出。然后,空间提示和目标扬声器的嵌入都用于对目标扬声器形成自上而下的听觉关注。称为MC-Libri2mix的多通道混响数据集上的实验表明,我们的L-SPEX方法显着胜过基线系统。
Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.