论文标题

通过从预训练的语言模型中转移知识来改善基于CTC的语音识别

Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

论文作者

Deng, Keqi, Cao, Songjun, Zhang, Yike, Ma, Long, Cheng, Gaofeng, Xu, Ji, Zhang, Pengyuan

论文摘要

最近,基于Connectionist时间分类(CTC)的端到端自动语音识别模型取得了令人印象深刻的结果,尤其是当WAV2VEC2.0模型进行微调时。由于有条件的独立性假设,基于CTC的模型总是比基于注意的编码器模型较弱,并且需要外部语言模型(LMS)的帮助。为了解决这个问题,我们提出了两种知识转移方法,以利用预训练的LMS(例如BERT和GPT2)来改善基于CTC的模型。第一种方法基于表示学习,在这种学习中,基于CTC的模型将BERT产生的表示形式用作辅助学习目标。第二种方法基于联合分类学习,该学习将用于文本建模的GPT2与混合CTC/注意体系结构相结合。对Aishell-1语料库的实验在测试集中得出4.2%的字符错误率(CER)。与从WAV2VEC2.0模型进行微调的基于Vanilla CTC的模型相比,我们的知识转移方法将CER减少了16.1%,而没有外部LMS。

Recently, end-to-end automatic speech recognition models based on connectionist temporal classification (CTC) have achieved impressive results, especially when fine-tuned from wav2vec2.0 models. Due to the conditional independence assumption, CTC-based models are always weaker than attention-based encoder-decoder models and require the assistance of external language models (LMs). To solve this issue, we propose two knowledge transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to improve CTC-based models. The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target. The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture. Experiment on AISHELL-1 corpus yields a character error rate (CER) of 4.2% on the test set. When compared to the vanilla CTC-based models fine-tuned from the wav2vec2.0 models, our knowledge transferring method reduces CER by 16.1% relatively without external LMs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源