提示：具有文本说明的可控制文本到语音

论文标题

提示：具有文本说明的可控制文本到语音

PromptTTS: Controllable Text-to-Speech with Text Descriptions

论文作者

Guo, Zhifang, Leng, Yichong, Wu, Yihan, Zhao, Sheng, Tan, Xu

论文摘要

使用文本描述作为指导文本或图像的产生（例如，GPT-3或Dalle-2）最近引起了广泛的关注。除了文本和图像生成之外，在这项工作中，我们还探索了使用文本描述来指导语音综合的可能性。因此，我们开发了一个文本到语音（TTS）系统（称为提示），该系统以样式和内容描述为提示，作为综合相应语音的输入。具体而言，提示器由样式编码器和内容编码器组成，以从提示中提取相应的表示形式，以及根据提取的样式和内容表示形式合成语音的语音解码器。与以前可控的TT中的作品相比，用户需要具有声学知识来理解韵律和音调等风格因素，因此提示ttts更加用户友好，因为文本描述是表达语音风格的一种更自然的方式（例如，'女士对朋友慢慢窃窃私语''）。鉴于没有提示的TTS数据集，可以基于提示的任务，因此我们构建并发布一个包含带有样式和内容信息以及相应语音的提示的数据集。实验表明，提示可以通过精确的样式控制和高语音质量生成语音。音频样本和我们的数据集公开可用。

Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ''A lady whispers to her friend slowly''). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题