WAV2SEQ：使用伪语言的训练前语音到文本编码器模型

论文标题

WAV2SEQ：使用伪语言的训练前语音到文本编码器模型

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

论文作者

Wu, Felix, Kim, Kwangyoun, Watanabe, Shinji, Han, Kyu, McDonald, Ryan, Weinberger, Kilian Q., Artzi, Yoav

论文摘要

我们介绍了Wav2Seq，这是第一种自我监督的方法，用于预先培训编码器模型的两个部分用于语音数据。我们将伪语言作为紧凑的离散表示形式诱导，并制定一个自我监督的伪语音识别任务 - 将音频输入转录为伪子字序列。此过程独立存在，或可以作为低成本的第二阶段预训练应用。我们尝试自动语音识别（ASR），命名实体识别和语音到文本翻译。我们为端到端口语命名实体识别设置了新的最先进的结果，并在20个语言对中表现出一致的改进，即使竞争方法使用其他文本数据进行培训，也可以进行语音到文本翻译。最后，在ASR上，我们的方法使编码器二次的方法能够受益于网络所有部分的预训练，并显示出与高度优化的最新方法相当的性能。

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题