论文标题
对噪声ASR的语音增强和自我监督模型的联合培训
Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR
论文作者
论文摘要
通常需要语音增强(SE)作为前端,以改善噪声环境中的语音质量,而增强的语音可能对由于语音失真而导致的自动语音识别(ASR)系统可能不是最佳的。另一方面,结果表明,自制的预训练可以利用大量未标记的嘈杂数据,这对ASR的噪声稳健性相当有益。但是,SE和自我监督的预训练(最佳)整合的潜力仍然不清楚。为了找到适当的组合并减少SE引起的语音扭曲的影响,因此在本文中,我们提出了SE模块和自我监督模型的联合预训练方法。首先,在训练前阶段,原始噪声波形或SE获得的波形被送入自我监督模型中,以学习上下文表示,其中量化的清洁语音充当目标。其次,我们提出了一种双重注意融合方法,以融合嘈杂和增强语音的特征,这可以补偿由单独使用单个模块分别使用的信息损失。由于对清洁/嘈杂/增强的分支的灵活开发,该方法证明是对某些现有的噪声ASR模型的概括,例如增强的WAV2VEC2.0。最后,对合成和真实噪声数据集的实验结果表明,所提出的联合训练方法可以在各种嘈杂的环境下改善ASR性能,从而导致更强的噪声稳健性。
Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion. On the other hand, it was shown that self-supervised pre-training enables the utilization of a large amount of unlabeled noisy data, which is rather beneficial for the noise robustness of ASR. However, the potential of the (optimal) integration of SE and self-supervised pre-training still remains unclear. In order to find an appropriate combination and reduce the impact of speech distortion caused by SE, in this paper we therefore propose a joint pre-training approach for the SE module and the self-supervised model. First, in the pre-training phase the original noisy waveform or the waveform obtained by SE is fed into the self-supervised model to learn the contextual representation, where the quantified clean speech acts as the target. Second, we propose a dual-attention fusion method to fuse the features of noisy and enhanced speeches, which can compensate the information loss caused by separately using individual modules. Due to the flexible exploitation of clean/noisy/enhanced branches, the proposed method turns out to be a generalization of some existing noise-robust ASR models, e.g., enhanced wav2vec2.0. Finally, experimental results on both synthetic and real noisy datasets show that the proposed joint training approach can improve the ASR performance under various noisy settings, leading to a stronger noise robustness.