drspeech：与框架级别和话语级别的声音表示学习的文本到语音综合的降解态度

论文标题

drspeech：与框架级别和话语级别的声音表示学习的文本到语音综合的降解态度

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

论文作者

Saeki, Takaaki, Tachibana, Kentaro, Yamamoto, Ryuichi

论文摘要

大多数文本到语音（TTS）方法都使用在精心设计的环境中记录的高质量语音语料库，从而产生了高成本的数据收集成本。为了解决这个问题，现有的噪声TTS方法旨在将嘈杂的语音语料库用作培训数据。但是，它们仅解决时间不变或时间变化的声音。我们提出了一种降解型TTS方法，可以在包含附加噪声和环境扭曲的语音Corpora上培训。它共同代表带有框架级编码器的时变附加声音，以及带有发音级编码器的时间不变的环境扭曲。我们还提出了一种正则化方法，以获得清洁环境嵌入，该方法与语言依赖性信息（例如语言内容和说话者特征）相关。评估结果表明，我们的方法比以前的方法在包括添加噪声和混响（包括添加剂噪声）中获得了明显更高的合成语音。

Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions. It jointly represents the time-variant additive noises with a frame-level encoder and the time-invariant environmental distortions with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved significantly higher-quality synthetic speech than previous methods in the condition including both additive noise and reverberation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题