通过从预训练的语言模型中转移知识来改善基于CTC的语音识别

论文标题

通过从预训练的语言模型中转移知识来改善基于CTC的语音识别

Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

论文作者

Deng, Keqi, Cao, Songjun, Zhang, Yike, Ma, Long, Cheng, Gaofeng, Xu, Ji, Zhang, Pengyuan

论文摘要

最近，基于Connectionist时间分类（CTC）的端到端自动语音识别模型取得了令人印象深刻的结果，尤其是当WAV2VEC2.0模型进行微调时。由于有条件的独立性假设，基于CTC的模型总是比基于注意的编码器模型较弱，并且需要外部语言模型（LMS）的帮助。为了解决这个问题，我们提出了两种知识转移方法，以利用预训练的LMS（例如BERT和GPT2）来改善基于CTC的模型。第一种方法基于表示学习，在这种学习中，基于CTC的模型将BERT产生的表示形式用作辅助学习目标。第二种方法基于联合分类学习，该学习将用于文本建模的GPT2与混合CTC/注意体系结构相结合。对Aishell-1语料库的实验在测试集中得出4.2％的字符错误率（CER）。与从WAV2VEC2.0模型进行微调的基于Vanilla CTC的模型相比，我们的知识转移方法将CER减少了16.1％，而没有外部LMS。

Recently, end-to-end automatic speech recognition models based on connectionist temporal classification (CTC) have achieved impressive results, especially when fine-tuned from wav2vec2.0 models. Due to the conditional independence assumption, CTC-based models are always weaker than attention-based encoder-decoder models and require the assistance of external language models (LMs). To solve this issue, we propose two knowledge transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to improve CTC-based models. The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target. The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture. Experiment on AISHELL-1 corpus yields a character error rate (CER) of 4.2% on the test set. When compared to the vanilla CTC-based models fine-tuned from the wav2vec2.0 models, our knowledge transferring method reduces CER by 16.1% relatively without external LMs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题