零击语音转换的稳健散布的变分语音表示学习

论文标题

零击语音转换的稳健散布的变分语音表示学习

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

论文作者

Lian, Jiachen, Zhang, Chunlei, Yu, Dong

论文摘要

语音转换（VC）的传统研究通过并行培训数据和已知的演讲者取得了进步。通过探索更好的对齐模块或表达映射功能，可以获得良好的语音转换质量。在这项研究中，我们从自我监督的分散语音表示学习的新角度研究了零射击VC。具体而言，我们通过在连续变异自动编码器（VAE）中平衡全局说话者表示和随时间变化的内容表示之间的信息流来实现分离。通过将任意的扬声器嵌入和内容嵌入到VAE解码器中，可以执行零拍的语音转换。除此之外，还采用了一种直立的数据增强培训策略来使学习的表示形式不变。在TIMIT和VCTK数据集上，我们在客观评估中实现了最先进的表现，即说话者的嵌入和内容嵌入者的说话者验证（SV），以及主观评估，即语音自然性和相似性，即使有嘈杂的源/目标话语，也尚有良好。

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e., voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题