探索基于RNN换能器的端到端语音识别的对齐方式的预训练

论文标题

探索基于RNN换能器的端到端语音识别的对齐方式的预训练

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

论文作者

Hu, Hu, Zhao, Rui, Li, Jinyu, Lu, Liang, Gong, Yifan

论文摘要

最近，由于其能够在线流媒体语音识别能力的优势，复发性神经网络传感器（RNN-T）体系结构已成为端到端自动语音识别研究的新兴趋势。但是，由于巨大的记忆要求和复杂的神经结构，RNN-T训练变得困难。缓解RNN-T训练的一种常见解决方案是使用连接派时间分类（CTC）模型以及RNN语言模型（RNNLM）来初始化RNN-T参数。在这项工作中，我们相反利用外部对齐来播种RNN-T模型。探索了两种不同的预训练解决方案，分别称为编码器预训练和整个网络预训练。在Microsoft 65,000小时的匿名生产数据中进行了评估，并删除了个人识别信息，我们提出的方法可以得到显着改进。特别是，与随机初始化和广泛使用的CTC+RNNLM初始化策略相比，编码器预训练解决方案可实现10％和8％的相对单词错误率。我们的解决方案还大大减少了基线的RNN-T模型延迟。

Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN language model (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题