基于自我监督语音表示的语音转换的比较研究

论文标题

基于自我监督语音表示的语音转换的比较研究

A Comparative Study of Self-supervised Speech Representation Based Voice Conversion

论文作者

Huang, Wen-Chin, Yang, Shu-Wen, Hayashi, Tomoki, Toda, Tomoki

论文摘要

我们提出了一项对基于自我监督的语音表示（S3R）语音转换（VC）的大规模比较研究。在识别合成VC的背景下，S3RS具有替代昂贵的有监督表示的潜力，例如语音后验（PPG），这通常是由最先进的VC系统采用的。使用先前开发的开源VC软件S3PRL-VC，我们使用语音转换挑战2020（VCC2020）数据集提供了三种VC设置下的一系列深入目标和主观分析：内部/跨语言中的任何对一个（A2O）和任何对任何（A2A）VC。我们在各个方面研究了基于S3R的VC，包括模型类型，多语言和监督。我们还研究了通过K-均值聚类的饮后过程的效果，并展示了其在A2A设置中的改善。最后，与最先进的VC系统的比较证明了基于S3R的VC的竞争力，并阐明了可能的改进方向。

We present a large-scale comparative study of self-supervised speech representation (S3R)-based voice conversion (VC). In the context of recognition-synthesis VC, S3Rs are attractive owing to their potential to replace expensive supervised representations such as phonetic posteriorgrams (PPGs), which are commonly adopted by state-of-the-art VC systems. Using S3PRL-VC, an open-source VC software we previously developed, we provide a series of in-depth objective and subjective analyses under three VC settings: intra-/cross-lingual any-to-one (A2O) and any-to-any (A2A) VC, using the voice conversion challenge 2020 (VCC2020) dataset. We investigated S3R-based VC in various aspects, including model type, multilinguality, and supervision. We also studied the effect of a post-discretization process with k-means clustering and showed how it improves in the A2A setting. Finally, the comparison with state-of-the-art VC systems demonstrates the competitiveness of S3R-based VC and also sheds light on the possible improving directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题