使用韵律和虚假开始的数据增强以识别非本地儿童的演讲

论文标题

使用韵律和虚假开始的数据增强以识别非本地儿童的演讲

Data augmentation using prosody and false starts to recognize non-native children's speech

论文作者

Kathania, Hemant, Singh, Mittul, Grósz, Tamás, Kurimo, Mikko

论文摘要

本文介绍了Aaltoasr的语音识别系统，用于Interspeech 2020年非本地儿童言语的自动语音识别（ASR）共享任务。任务是认识到有限的语音，各个年龄段的儿童的非本地演讲。此外，自发的语音被视为部分单词，这在测试转录中导致了看不见的部分单词。为了应对这两个挑战，我们研究了一种基于数据增强的方法。首先，我们应用基于韵律的数据扩展来补充音频数据。其次，我们通过在语言建模语料库中引入部分字噪声来模拟错误的开始，创建了新单词。接受基于韵律的增强数据训练的声学模型使用基线配方或基于规格的增强量优于模型。部分词的噪声还有助于改善基线语言模型。我们的ASR系统是这些方案的组合，在评估期间排名第三，并达到了18.71％的错误率。评估后，我们观察到，增加基于韵律的增强数据的量会导致更好的性能。此外，从假设中删除低信任得分单词可能会导致进一步的收益。这两个改进将ASR错误率降低至17.99％。

This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children's speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosody-based data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题