可控制的自发性TT与神经HMM

论文标题

可控制的自发性TT与神经HMM

Prosody-controllable spontaneous TTS with neural HMMs

论文作者

Lameris, Harm, Mehta, Shivam, Henter, Gustav Eje, Gustafson, Joakim, Székely, Éva

论文摘要

自发的语音具有许多情感和务实的功能，在TTS中建模很有趣且具有挑战性。然而，在自发语音中的发音，填充物，重复和其他障碍的存在减少的存在使文本和声学的一致性不如阅读语音，这对于基于注意的TTS来说是有问题的。我们提出了一个TTS体系结构，该体系结构可以迅速学习从小型和不规则的数据集中讲话，同时还可以再现自发演讲中存在的表达现象的多样性。具体而言，我们将话语级韵律控制添加到现有的基于神经HMM的TTS系统中，该系统能够自发语音稳定，单调的一致性。我们客观地评估控制精度并进行感知测试，以证明韵律控制不会降低合成质量。为了体现结合韵律控制和生态有效的数据的力量，用于再现复杂的自发语音现象，我们评估了系统合成两种类型的吱吱作响的声音的能力。音频样本可在https://www.speech.kth.kth.se/tts-demos/prosodic-hmm/上找到

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system's capability of synthesizing two types of creaky voice. Audio samples are available at https://www.speech.kth.se/tts-demos/prosodic-hmm/

下载PDF全文

下载文献需遵守相关版权规定

论文标题