论文标题
增强生成语言建模的不变离散表示
Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling
论文作者
论文摘要
生成语言建模研究的重点是使用原始音频记录在不访问任何文本监督的情况下优化语音语言模型(LMS)。这样的语音LM通常是通过量化自我监督模型内部表示的离散单位运行的。尽管此类单元显示出令人印象深刻的建模结果,但其鲁棒性功能尚未得到广泛的研究。这项工作着重于改善生成语言建模的离散输入表示的鲁棒性。首先,我们正式定义了如何测量这种表示形式的鲁棒性,这些信号变化不会改变口语信息(例如,时间拉伸)。接下来,我们从经验上证明了当前最新表示模型如何缺乏这种变化的鲁棒性。为了克服这一点,我们提出了一种有效而有效的方法,以学习生成语言建模的强大离散语音表示。所提出的方法基于将一组信号转换应用于语音信号,并使用迭代伪标记方案优化模型。在考虑编码和建模指标时,我们的方法对评估的基线有了显着改善。我们还考虑了语音到语音翻译任务的方法,即考虑西班牙语英语和法语 - 英语翻译,并显示所提出的方法的表现优于评估的基线。
Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensively investigated. This work focuses on improving the robustness of discrete input representations for generative spoken language modeling. First, we formally define how to measure the robustness of such representations to various signal variations that do not alter the spoken information (e.g., time-stretch). Next, we empirically demonstrate how current state-of-the-art representation models lack robustness to such variations. To overcome this, we propose an effective and efficient method to learn robust discrete speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding and modeling metrics. We additionally evaluate our method on the speech-to-speech translation task, considering Spanish-English and French-English translations, and show the proposed approach outperforms the evaluated baselines.