通过无监督数据改善流媒体自动语音识别

论文标题

通过无监督数据改善流媒体自动语音识别

Improving Streaming Automatic Speech Recognition With Non-Streaming Model Distillation On Unsupervised Data

论文作者

Doutre, Thibault, Han, Wei, Ma, Min, Lu, Zhiyun, Chiu, Chung-Cheng, Pang, Ruoming, Narayanan, Arun, Misra, Ananya, Zhang, Yu, Cao, Liangliang

论文摘要

流端到端自动语音识别（ASR）模型广泛用于智能扬声器和设备应用程序。由于预计这些模型将以最小的延迟转录语音，因此与非流传输的同类产品相比，它们被限制为因果关系而没有未来上下文。因此，流媒体模型通常比非流式模型差。我们通过利用非流式ASR模型作为老师来提出一种新颖有效的学习方法，以在任意大型数据集上生成成绩单，然后将其用于将知识提炼为流媒体ASR模型。这样，我们将流媒体模型的培训扩展到最多300万小时的YouTube音频。实验表明，我们的方法不仅可以大大降低RNNT模型的单词错误率（WER），不仅在LibrisPeech上，而且在YouTube数据上使用四种语言。例如，在法语中，我们能够通过利用与基线相同数量的标记数据训练的非流式教师模型，将WER相对于基线流模型减少16.4％。

Streaming end-to-end automatic speech recognition (ASR) models are widely used on smart speakers and on-device applications. Since these models are expected to transcribe speech with minimal latency, they are constrained to be causal with no future context, compared to their non-streaming counterparts. Consequently, streaming models usually perform worse than non-streaming models. We propose a novel and effective learning method by leveraging a non-streaming ASR model as a teacher to generate transcripts on an arbitrarily large data set, which is then used to distill knowledge into streaming ASR models. This way, we scale the training of streaming models to up to 3 million hours of YouTube audio. Experiments show that our approach can significantly reduce the word error rate (WER) of RNNT models not only on LibriSpeech but also on YouTube data in four languages. For example, in French, we are able to reduce the WER by 16.4% relatively to a baseline streaming model by leveraging a non-streaming teacher model trained on the same amount of labeled data as the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题