Autotts：端到端文本到语音综合通过可区分的持续时间建模

论文标题

Autotts：端到端文本到语音综合通过可区分的持续时间建模

AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration Modeling

论文作者

Nguyen, Bac, Cardinaux, Fabien, Uhlich, Stefan

论文摘要

并行文本到语音（TTS）模型最近启用了快速且高度自然的语音综合。但是，它们通常需要外部对齐模型，因为它们未经共同训练，因此不一定针对解码器进行优化。在本文中，我们提出了一种可区分的持续时间方法，用于学习输入序列和输出序列之间的单调比对。我们的方法基于一种软性机制，该机制优化了预期的随机过程。使用这种可区分的持续时间方法，我们介绍了Autotts，这是一种直接的文本到波形语音综合模型。 Autotts通过对抗性训练并匹配总体真实持续时间的结合来实现高保真语音综合。实验结果表明，我们的模型在享受更简单的培训管道的同时获得了竞争成果。音频样本可在线提供。

Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题