联合声音回声取消，语音增强和语音分离的普遍缺乏的ASR前端

论文标题

联合声音回声取消，语音增强和语音分离的普遍缺乏的ASR前端

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation

论文作者

O'Malley, Tom, Narayanan, Arun, Wang, Quan

论文摘要

最近的工作表明，可以训练单个模型来执行联合声学回声取消（AEC），语音增强和语音分离，从而作为统一的自动语音识别（ASR）的统一前端。联合模型使用上下文信息，例如播放音频，噪声上下文和说话者嵌入的参考。在这项工作中，我们建议对这种模型进行许多新颖的改进。首先，我们改善了用于将噪声上下文摄入模型的跨注意构象异构体的结构。其次，我们将模型概括为能够处理不同长度的噪声上下文。第三，我们提出了信号辍学，这是一种模拟缺少上下文信息的新颖策略。在没有一个或多个信号的情况下，提出的模型的性能几乎和没有这些信号的训练的特定任务模型一样。当存在此类信号时，我们的系统与需要所有上下文信号的系统进行了很好的比较。在基线上，当不存在说话者嵌入时，最终模型在背景语音上保留了相对单词错误率的25.0％，当不进行设备播放时，AEC的嵌入式误差率为25.0％。

Recent work has shown that it is possible to train a single model to perform joint acoustic echo cancellation (AEC), speech enhancement, and voice separation, thereby serving as a unified frontend for robust automatic speech recognition (ASR). The joint model uses contextual information, such as a reference of the playback audio, noise context, and speaker embedding. In this work, we propose a number of novel improvements to such a model. First, we improve the architecture of the Cross-Attention Conformer that is used to ingest noise context into the model. Second, we generalize the model to be able to handle varying lengths of noise context. Third, we propose Signal Dropout, a novel strategy that models missing contextual information. In the absence of one or more signals, the proposed model performs nearly as well as task-specific models trained without these signals; and when such signals are present, our system compares well against systems that require all context signals. Over the baseline, the final model retains a relative word error rate reduction of 25.0% on background speech when speaker embedding is absent, and 61.2% on AEC when device playback is absent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题