代表性子集选择在自我监督的语音识别中有效微调

论文标题

代表性子集选择在自我监督的语音识别中有效微调

Representative Subset Selection for Efficient Fine-Tuning in Self-Supervised Speech Recognition

论文作者

Azeemi, Abdul Hameed, Qazi, Ihsan Ayyub, Raza, Agha Ali

论文摘要

自我监督的语音识别模型需要大量标记的培训数据，以学习自动语音识别（ASR）的高保真表示，这是计算要求且耗时的。我们考虑确定最佳数据子集的任务，以在ASR的自我监督语音模型中有效进行微调。我们发现，用于抽样最有用的示例中使用的数据集修剪策略并不比在微调自我监督的ASR上的随机子集选择更好。然后，我们介绍自我监督的ASR中代表性子集选择的Cowerage算法。 Cowerage是基于我们的发现，即确保基于培训单词错误率（WER）在早期训练时期的范围覆盖示例，从而可以更好地泛化性能。在TimIt，LibrisPeech和LjSpeech数据集上使用WAV2VEC 2.0和HUBERT模型进行了广泛的实验，显示了Cowerage及其在模型之间的可传递性的有效性，比现有数据集修剪方法和随机采样的相对相对改善了17％。我们还证明，培训实例的覆盖范围可确保包含语音多样的示例，从而在自我监督的语音识别模型中更好地测试准确性。

Self-supervised speech recognition models require considerable labeled training data for learning high-fidelity representations for Automatic Speech Recognition (ASR) which is computationally demanding and time-consuming. We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We discover that the dataset pruning strategies used in vision tasks for sampling the most informative examples do not perform better than random subset selection on fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for representative subset selection in self-supervised ASR. COWERAGE is based on our finding that ensuring the coverage of examples based on training Word Error Rate (WER) in the early training epochs leads to better generalization performance. Extensive experiments with the wav2vec 2.0 and HuBERT model on TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE and its transferability across models, with up to 17% relative WER improvement over existing dataset pruning methods and random sampling. We also demonstrate that the coverage of training instances in terms of WER values ensures the inclusion of phonemically diverse examples, leading to better test accuracy in self-supervised speech recognition models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题