跨语性的自我监督语音表示，以改善违反语音识别

论文标题

跨语性的自我监督语音表示，以改善违反语音识别

Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition

论文作者

Hernandez, Abner, Pérez-Toro, Paula Andrea, Nöth, Elmar, Orozco-Arroyave, Juan Rafael, Maier, Andreas, Yang, Seung Hee

论文摘要

最先进的自动语音识别（ASR）系统在健康语音上表现良好。但是，言语受损的表现仍然是一个问题。当前的研究探讨了使用WAV2VEC自我监督语音表示作为训练ASR系统违反语音的特征的有用性。违反语音识别尤其困难，因为言语的几个方面，例如发音，韵律和发音可能会受到损害。具体而言，我们训练一个声学模型，其特征是从Wav2Vec，Hubert和跨语义XLSR模型中提取的特征。结果表明，在大型未标记数据上鉴定的语音表示可以提高单词错误率（WER）性能。特别是，多语言模型的功能导致WERS较低，而不是过滤库（Fbank）或接受单语言训练的模型。在脑瘫引起的构造障碍（Uapeech语料库），带有Parkinsonian Asrthria（PC-Gita corpus）的西班牙语者（PC-Gita corpus）和意大利语的说话者的英语说话者中观察到了改善。与使用FBANK功能相比，UaSpeech，PC-Gita和EasyCall语料库的基于XLSR的功能将WERS降低了6.8％，22.0％和7.0％。

State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than filterbanks (Fbank) or models trained on a single language. Improvements were observed in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题