双向语音编码器的无监督预培训通过掩盖重建

论文标题

双向语音编码器的无监督预培训通过掩盖重建

Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction

论文作者

Wang, Weiran, Tang, Qingming, Livescu, Karen

论文摘要

我们提出了一种通过掩盖重建损失进行预训练语音表示的方法。我们的预训练的编码器网络是双向的，因此可以直接用于典型的双向语音识别模型。然后，可以对较小的监督数据进行微调以进行语音识别，然后对预训练的网络进行微调。在Librispeech和Wall Street Journal Corpora上使用这种方法的实验显示出令人鼓舞的结果。我们发现，导致语音识别改进的主要因素是：掩盖时间和频率足够宽度的段，对未标记数据的未标记数据进行预训练，而当未标记的数据和标记的数据来自不同的域时，则适应了域。预培训中的增益是监督数据增强的增值。

We propose an approach for pre-training speech representations via a masked reconstruction loss. Our pre-trained encoder networks are bidirectional and can therefore be used directly in typical bidirectional speech recognition models. The pre-trained networks can then be fine-tuned on a smaller amount of supervised data for speech recognition. Experiments with this approach on the LibriSpeech and Wall Street Journal corpora show promising results. We find that the main factors that lead to speech recognition improvements are: masking segments of sufficient width in both time and frequency, pre-training on a much larger amount of unlabeled data than the labeled data, and domain adaptation when the unlabeled and labeled data come from different domains. The gain from pre-training is additive to that of supervised data augmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题