Norespeech：基于知识蒸馏的条件扩散模型

论文标题

Norespeech：基于知识蒸馏的条件扩散模型

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

论文作者

Yang, Dongchao, Liu, Songxiang, Yu, Jianwei, Wang, Helin, Weng, Chao, Zou, Yuexian

论文摘要

表达文本到语音（TTS）可以通过引起参考音频的韵律和音色来综合新的口语样式，该音频面临以下挑战：（1）参考音频中的高度动态韵律信息很难提取，尤其是当参考音频包含背景噪声时。（2）TTS系统应该对看不见的口语风格具有良好的概括。在本文中，我们提出了一个\ textbf {no} ise- \ textbf {r} obust \ textbf {e} xprescersion xpressive tts model（norespeech），该模型可以以嘈杂的引用语言在综合语音中以嘈杂的引用来稳健地传递说话风格。具体来说，我们的Norespeech包括几个组件：（1）一种新颖的二型差异模块，该模块利用强大的概率转化扩散模型来学习通过知识蒸馏的教师模型从教师模型中学习噪声 - 敏捷的口语风格；（2）VQ-VAE块，将样式的特征映射到可控制的量化潜在空间中，以改善样式转移的概括；（3）直接但有效的无参数的文本式对齐模块，该模块使Norespeech能够从长度不匹配的参考语音中将样式传输到文本输入。实验表明，在噪声环境中，Norespeech比以前的表达性TTS模型更有效。音频示例和代码可在：\ href {http://dongchaoyang.top/norespeech \ _demo/} {http://dongchaoyang.top/norespeech_demo/}

Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In this paper, we present a \textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech), which can robustly transfer speaking style in a noisy reference utterance to synthesized speech. Specifically, our NoreSpeech includes several components: (1) a novel DiffStyle module, which leverages powerful probabilistic denoising diffusion models to learn noise-agnostic speaking style features from a teacher model by knowledge distillation; (2) a VQ-VAE block, which maps the style features into a controllable quantized latent space for improving the generalization of style transfer; and (3) a straight-forward but effective parameter-free text-style alignment module, which enables NoreSpeech to transfer style to a textual input from a length-mismatched reference utterance. Experiments demonstrate that NoreSpeech is more effective than previous expressive TTS models in noise environments. Audio samples and code are available at: \href{http://dongchaoyang.top/NoreSpeech\_demo/}{http://dongchaoyang.top/NoreSpeech\_demo/}

下载PDF全文

下载文献需遵守相关版权规定

论文标题