跨域语音活动检测与自我监督的表示

论文标题

跨域语音活动检测与自我监督的表示

Cross-domain Voice Activity Detection with Self-Supervised Representations

论文作者

Alisamir, Sina, Ringeval, Fabien, Portet, Francois

论文摘要

语音活动检测（VAD）旨在检测音频信号上的语音段，这对于许多今天的基于语音的应用程序来说是必要的第一步。当前的最新方法着重于训练直接包含声学中包含的神经网络的利用功能，例如MEL Filter Banks（MFBS）。因此，此类方法需要一个额外的归一化步骤，以适应影响声学的新领域，这可能仅仅是由于说话者，麦克风或环境的变化所致。此外，这种归一化步骤通常是一种具有一定局限性的基本方法，例如高度容易受到新域可用的数据量。在这里，我们利用了众包共同的声音（CV）语料库，以表明基于自学的学习（SSL）的表示形式可以很好地适应不同的领域，因为它们是通过跨多个领域的语音表达来计算的。 SSL表示还比基于手工制作的表示（MFB）和现成的VAD的系统获得更好的结果，并在跨域设置方面有了显着改善。

Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal, which is a necessary first step for many today's speech based applications. Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics, such as Mel Filter Banks (MFBs). Such methods therefore require an extra normalisation step to adapt to a new domain where the acoustics is impacted, which can be simply due to a change of speaker, microphone, or environment. In addition, this normalisation step is usually a rather rudimentary method that has certain limitations, such as being highly susceptible to the amount of data available for the new domain. Here, we exploited the crowd-sourced Common Voice (CV) corpus to show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains, because they are computed with contextualised representations of speech across multiple domains. SSL representations also achieve better results than systems based on hand-crafted representations (MFBs), and off-the-shelf VADs, with significant improvement in cross-domain settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题