不含样式标签的：通过量化的VAE和扬声器归一化的跨言扬声器风格转移语音合成

论文标题

不含样式标签的：通过量化的VAE和扬声器归一化的跨言扬声器风格转移语音合成

Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

论文作者

Qiang, Chunyu, Yang, Peng, Che, Hao, Wang, Xiaorui, Wang, Zhongyuan

论文摘要

语音合成中的跨言扬声器风格转移旨在将风格从源扬声器转移到目标扬声器音色的综合语音。以前的大多数方法都取决于带有样式标签的数据，但是手动通知的标签很昂贵，而且并不总是可靠的。为了应对这个问题，我们提出了不含样式标签的跨语扬声器样式传输方法，该方法可以实现从源扬声器到目标扬声器的样式传输，而没有样式标签。首先，基于量化的变异自动编码器（Q-VAE）和样式瓶颈的参考编码器结构旨在提取离散样式表示。其次，提出了通过扬声器批处理层的层层，以减少源扬声器泄漏。为了提高参考编码器的样式提取能力，提出了一种样式不变和对比度数据增强方法。实验结果表明该方法的表现优于基线。我们为网站提供音频样本。

Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题