端到端的远场语音识别，并具有统一的修道

论文标题

端到端的远场语音识别，并具有统一的修道

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

论文作者

Zhang, Wangyou, Subramanian, Aswin Shanmugam, Chang, Xuankai, Watanabe, Shinji, Qian, Yanmin

论文摘要

尽管在多渠道语音识别中成功地应用了端到端方法，但当演讲被混响破坏时，表现仍然严重降低。在本文中，我们将编织模块集成到端到端的多通道语音识别系统中，并探索两个不同的前端体系结构。首先，将基于多源掩模的加权预测误差（WPE）模块合并到前端以进行验证。其次，提出了另一种新颖的前端体系结构，该体系结构扩展了加权最小化无失真响应（WPD）卷积光束器以执行同时的分离和缩放。我们从原始的WPD中得出了一种新的公式，该公式可以处理多源输入，并用矩阵逆操作替换特征值分解，以使后传播算法更稳定。以上两个体系结构以完全端到端的方式进行了优化，仅使用语音识别标准。在空间化的WSJ1-2MIX语料库和混响上进行的实验表明，我们提出的模型在混响场景中的表现优于常规方法。

Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题