Emformer：低潜伏期流语音识别的有效内存变压器的声学模型

论文标题

Emformer：低潜伏期流语音识别的有效内存变压器的声学模型

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

论文作者

Shi, Yangyang, Wang, Yongqiang, Wu, Chunyang, Yeh, Ching-Feng, Chan, Julian, Zhang, Frank, Le, Duc, Seltzer, Mike

论文摘要

本文提出了一个有效的内存变压器emformer，用于低潜伏期流语音识别。在Emformer中，将远程历史上下文蒸馏成增强的记忆库，以减少自我注意力的计算复杂性。缓存机制可以为左下文中的自我注意力保存键和价值的计算。 Emformer在训练中应用并行的块处理以支持低潜伏期模型。我们在基准LiblisPeech数据上进行实验。在960毫秒的平均延迟下，Emformer的测试清洁中获得$ 2.50 \％$，$ 5.62 \％$在test--y上获得。与强大的基线增强内存变压器（AM-TRF）相比，Emformer获得$ 4.6 $折叠的训练速度和$ 18 \％$ $ $相对实时因子（RTF）减少，而相对降低的测试和$ 17 \％$ $ 17 \％$ $ 17 \％$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ \％\％\％$。对于低延迟方案，平均延迟为80毫秒，Emformer在测试清洁上的$ 3.01 \％$ $ 3.01 \％$，$ 7.09 \％$ $ $ $ $ $。与LSTM基线具有相同的延迟和模型大小相比，Emformer分别在测试清洁和测试中分别获得相对减少$ 9 \％$和$ 16 \％$。

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18\%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题