论文标题
语音代理的对话端到端TT
Conversational End-to-End TTS for Voice Agent
论文作者
论文摘要
端到端的神经TTS在阅读风格的语音综合方面取得了出色的表现。但是,由于语料库的局限性和建模能力,建立高质量的对话TTS仍然是一个挑战。这项研究旨在在序列下为语音代理建立对话tts,以序列建模框架。首先,我们通过新的录音方案构建了一个为语音代理设计精心设计的自发对话演讲语料库,以确保录制质量和对话性讲话风格。其次,我们提出了一种对话上下文感知的端到端TTS方法,该方法具有辅助编码器和对话上下文编码器,以在对话中加强有关当前话语及其上下文的信息。实验结果表明,所提出的方法会根据对话环境产生更多的自然韵律,并且在话语级别和对话级别上都有显着的偏好收益。此外,我们发现该模型具有表达一些自发行为的能力,例如填充和重复的单词,这使对话说话风格更加现实。
End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational context, with significant preference gains at both utterance-level and conversation-level. Moreover, we find that the model has the ability to express some spontaneous behaviors, like fillers and repeated words, which makes the conversational speaking style more realistic.