笑出来吗？从包含非语言片段的语音中学习强大的说话者表示

论文标题

笑出来吗？从包含非语言片段的语音中学习强大的说话者表示

Laugh Betrays You? Learning Robust Speaker Representation From Speech Containing Non-Verbal Fragments

论文作者

Lin, Yuke, Qin, Xiaoyi, Cui, Huahua, Zhu, Zhenyi, Li, Ming

论文摘要

自动说话者验证的成功表明，可以从中性语音中提取判别式说话者表示。但是，作为一种非语言的声音，笑声也应直观地传递扬声器信息。因此，本文着重于探索有关包含非语言笑声的话语的扬声器验证。我们通过在Voxceleb和CN-Celeb数据集的一部分上进行笑声检测脚本来收集一组带有笑声的剪辑。为了进一步过滤不信任的剪辑，概率得分是由我们的二元笑声检测分类器计算得出的，该分类器是由纯笑声和中性语音预先训练的。之后，基于其得分超过阈值的剪辑，我们在两个不同的评估场景下构建试验：笑声（LL）和言语笑声（SL）。然后提出了一种称为基于笑声的网络（LSN）的新颖方法，该方法在两种情况下都可以显着提高性能并在中性语音上保持性能，例如voxceleb1测试集。具体而言，我们的系统分别在笑声和言语论笑容的试验方面取得了相对20％和22％的提高。元数据和样品剪辑已在https://github.com/nevermorelin/laugh_lsn上发布。

The success of automatic speaker verification shows that discriminative speaker representations can be extracted from neutral speech. However, as a kind of non-verbal voice, laughter should also carry speaker information intuitively. Thus, this paper focuses on exploring speaker verification about utterances containing non-verbal laughter segments. We collect a set of clips with laughter components by conducting a laughter detection script on VoxCeleb and part of the CN-Celeb dataset. To further filter untrusted clips, probability scores are calculated by our binary laughter detection classifier, which is pre-trained by pure laughter and neutral speech. After that, based on the clips whose scores are over the threshold, we construct trials under two different evaluation scenarios: Laughter-Laughter (LL) and Speech-Laughter (SL). Then a novel method called Laughter-Splicing based Network (LSN) is proposed, which can significantly boost performance in both scenarios and maintain the performance on the neutral speech, such as the VoxCeleb1 test set. Specifically, our system achieves relative 20% and 22% improvement on Laughter-Laughter and Speech-Laughter trials, respectively. The meta-data and sample clips have been released at https://github.com/nevermoreLin/Laugh_LSN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题