论文标题
多任务自我监督的学习,以进行强大的语音识别
Multi-task self-supervised learning for Robust Speech Recognition
论文作者
论文摘要
尽管对无监督学习的兴趣越来越大,但从未标记的音频中提取有意义的知识仍然是一个开放的挑战。为了朝这个方向迈出一步,我们最近提出了一个问题不足的语音编码器(PASE),该语音编码器结合了一个卷积编码器,然后是多个神经网络(称为工人),被称为工人,负责解决自我监督的问题(即,不需要手动注释的问题)。 PASE被证明可以捕获相关的语音信息,包括说话者的语音印刷和音素。本文提出了PASE+,这是在嘈杂和混响环境中改进的PASE的改进版本。为此,我们采用了一个在线语音失真模块,该模块污染了各种随机干扰的输入信号。然后,我们提出了一个经过修订的编码器,该编码器通过有效的复发和卷积网络有效组合更好地学习短期和长期语音动态。最后,我们完善了在自我训练中使用的工人集,以鼓励更好的合作。在Timit上,Dirha和Chime-5的结果表明,PASE+显着优于以前版本的PASE以及常见的声学特征。有趣的是,PASE+学习适合高度不匹配的声学条件的可转移表示形式。
Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation. Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.