论文标题

Emformer:低潜伏期流语音识别的有效内存变压器的声学模型

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

论文作者

Shi, Yangyang, Wang, Yongqiang, Wu, Chunyang, Yeh, Ching-Feng, Chan, Julian, Zhang, Frank, Le, Duc, Seltzer, Mike

论文摘要

本文提出了一个有效的内存变压器emformer,用于低潜伏期流语音识别。在Emformer中,将远程历史上下文蒸馏成增强的记忆库,以减少自我注意力的计算复杂性。缓存机制可以为左下文中的自我注意力保存键和价值的计算。 Emformer在训练中应用并行的块处理以支持低潜伏期模型。我们在基准LiblisPeech数据上进行实验。在960毫秒的平均延迟下,Emformer的测试清洁中获得$ 2.50 \%$,$ 5.62 \%$在test--y上获得。与强大的基线增强内存变压器(AM-TRF)相比,Emformer获得$ 4.6 $折叠的训练速度和$ 18 \%$ $ $相对实时因子(RTF)减少,而相对降低的测试和$ 17 \%$ $ 17 \%$ $ 17 \%$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ \%\%\%$。对于低延迟方案,平均延迟为80毫秒,Emformer在测试清洁上的$ 3.01 \%$ $ 3.01 \%$,$ 7.09 \%$ $ $ $ $ $。与LSTM基线具有相同的延迟和模型大小相比,Emformer分别在测试清洁和测试中分别获得相对减少$ 9 \%$和$ 16 \%$。

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18\%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源