多频段梅尔根（Melgan）：高质量文本到语音的更快波形生成

论文标题

多频段梅尔根（Melgan）：高质量文本到语音的更快波形生成

Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

论文作者

Yang, Geng, Yang, Shan, Liu, Kai, Fang, Peng, Chen, Wei, Xie, Lei

论文摘要

在本文中，我们提出了多波段梅尔根（Melgan），这是一种更快的波形生成模型，针对高质量的文本到语音。具体而言，我们通过以下方面改善了原始梅尔根。首先，我们增加了发电机的接收领域，事实证明这对言语产生有益。其次，我们将功能匹配损失用多分辨率STFT损失代替，以更好地衡量假语音和真实语音之间的差异。加上预培训，这种改进既可以提高质量更好，又可以提高训练稳定性。更重要的是，我们通过多波段处理扩展了梅尔根：发电机将MEL光谱图作为输入，并产生子频段信号，随后将这些信号汇总到全波段信号作为鉴别器输入。提出的多波段梅尔根（Melgan）分别在波形产生和TTS中达到了高度为4.34和4.22的高MOS。我们的模型只有191万参数，可以有效地将原始梅尔根的总计算复杂性从5.85降低到0.95 Gflops。我们的Pytorch实施将很快开放，即可在无硬件特定优化的情况下实现CPU的实时系数0.03。

In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation, which will be open-resourced shortly, can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题