论文标题
WAV2SEQ:使用伪语言的训练前语音到文本编码器模型
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages
论文作者
论文摘要
我们介绍了Wav2Seq,这是第一种自我监督的方法,用于预先培训编码器模型的两个部分用于语音数据。我们将伪语言作为紧凑的离散表示形式诱导,并制定一个自我监督的伪语音识别任务 - 将音频输入转录为伪子字序列。此过程独立存在,或可以作为低成本的第二阶段预训练应用。我们尝试自动语音识别(ASR),命名实体识别和语音到文本翻译。我们为端到端口语命名实体识别设置了新的最先进的结果,并在20个语言对中表现出一致的改进,即使竞争方法使用其他文本数据进行培训,也可以进行语音到文本翻译。最后,在ASR上,我们的方法使编码器二次的方法能够受益于网络所有部分的预训练,并显示出与高度优化的最新方法相当的性能。
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task -- transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.