通过凸起的变压器改善普通话的语音识别

论文标题

通过凸起的变压器改善普通话的语音识别

Improving Mandarin Speech Recogntion with Block-augmented Transformer

论文作者

Ren, Xiaoming, Zhu, Huifeng, Wei, Liuwei, Wu, Minghui, Hao, Jie

论文摘要

最近，卷积增强的变压器（构象异构体）在自动语音识别（ASR）方面表现出了令人鼓舞的结果，表现优于先前发表的最佳变压器传感器。在这项工作中，我们认为编码器和解码器中每个块的输出信息并非完全包容，换句话说，它们的输出信息可能是互补的。我们研究如何以参数有效的方式利用每个块的互补信息，并且可以预期这可能会导致更强的性能。因此，我们提出了刻板的变压器以进行语音识别，名为BlockFormer。我们已经实现了两个块集合方法：块输出的基本加权总和（基本-WSBO），以及挤压和激气模块到块输出的加权总和（SE-WSBO）。实验已经证明，阻滞剂在Aishell-1上的最新构象模型显着优于最先进的模型，我们的模型在不使用语言模型的情况下达到了4.29 \％的CER和4.05 \％的CER，并且在测试集上具有外部语言模型。

Recently Convolution-augmented Transformer (Conformer) has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer. In this work, we believe that the output information of each block in the encoder and decoder is not completely inclusive, in other words, their output information may be complementary. We study how to take advantage of the complementary information of each block in a parameter-efficient way, and it is expected that this may lead to more robust performance. Therefore we propose the Block-augmented Transformer for speech recognition, named Blockformer. We have implemented two block ensemble methods: the base Weighted Sum of the Blocks Output (Base-WSBO), and the Squeeze-and-Excitation module to Weighted Sum of the Blocks Output (SE-WSBO). Experiments have proved that the Blockformer significantly outperforms the state-of-the-art Conformer-based models on AISHELL-1, our model achieves a CER of 4.29\% without using a language model and 4.05\% with an external language model on the testset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题