论文标题
使用深度生成的混合网络和对抗对歧视者的非平行情绪转换
Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network and an Adversarial Pair Discriminator
论文作者
论文摘要
我们介绍了一种新颖的语音情感转化方法,该方法不需要平行的训练数据。我们的方法松散地依赖于周期型模式,以最大程度地减少情感对之间来回转换的重建误差。但是,与传统的周期器不同,我们的歧视器分类是一对输入的真实和生成的样品是否对应于所需的情绪转换(例如a到b)还是与其逆(b至a)。我们将证明,这种设置(我们称为变异循环-GAN(VC-GAN))等效地等于最大程度地减少源特征与其环状对应物之间的经验KL差异。此外,我们的发电机将可训练的深网与固定的生成块结合在一起,以对输入特征实现平稳且可逆的转换,在我们的情况下,基本频率(F0)轮廓。这种混合体系结构使我们的对抗性训练程序规律。我们使用人群采购来评估综合语音的情感显着性和质量。最后,我们表明我们的模型通过修改WaveNet产生的语音来推广到新的演讲者。
We introduce a novel method for emotion conversion in speech that does not require parallel training data. Our approach loosely relies on a cycle-GAN schema to minimize the reconstruction error from converting back and forth between emotion pairs. However, unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion (e.g., A to B) or to its inverse (B to A). We will show that this setup, which we refer to as a variational cycle-GAN (VC-GAN), is equivalent to minimizing the empirical KL divergence between the source features and their cyclic counterpart. In addition, our generator combines a trainable deep network with a fixed generative block to implement a smooth and invertible transformation on the input features, in our case, the fundamental frequency (F0) contour. This hybrid architecture regularizes our adversarial training procedure. We use crowd sourcing to evaluate both the emotional saliency and the quality of synthesized speech. Finally, we show that our model generalizes to new speakers by modifying speech produced by Wavenet.