论文标题

平行语音合成的光谱能距离

A Spectral Energy Distance for Parallel Speech Synthesis

论文作者

Gritsenko, Alexey A., Salimans, Tim, Berg, Rianne van den, Snoek, Jasper, Kalchbrenner, Nal

论文摘要

语音合成是一个重要的实用生成建模问题,在过去几年中取得了长足的进步,基于可能性的自回旋神经模型现在优于传统的串联系统。这种自回旋模型的缺点是,它们需要每秒生成的音频执行数万个顺序操作,这使得它们不适合在专门的深度学习硬件上部署。在这里,我们提出了一种新的学习方法,该方法使我们能够训练高度平行的语音模型,而无需访问分析可能性功能。我们的方法基于生成和真实音频的分布之间的广义能量距离。相对于生成波形音频的幅度 - 光谱图的分布,该光谱能距离是一个适当的得分规则,并提供了统计一致性保证。距离可以从无偏见的小型捕捞量计算出来,并且不涉及对抗性学习,从而产生了一种稳定且一致的方法来训练隐式生成模型。从经验上讲,我们通过最近提供的CFDSD度量来判断,在隐式生成模型之间达到了最新的一代质量。在将我们的方法与对抗技术相结合时,我们还根据训练有素的人类评估者所判断的平均意见分数来改善最近所提供的GAN-TTS模型。

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源