多演讲者文本到语音综合中自动质量估计的语音表示形式的比较

论文标题

多演讲者文本到语音综合中自动质量估计的语音表示形式的比较

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

论文作者

Williams, Jennifer, Rownicka, Joanna, Oplustil, Pilar, King, Simon

论文摘要

我们旨在表征不同的扬声器如何促进多演讲者文本到语音（TTS）合成的感知输出质量。我们使用接受人类平均意见评分（MOS）评分训练的神经网络（NN）自动评分TT的质量。首先，我们从ASVSPOOF 2019逻辑访问（LA）数据集对13个不同的TTS和语音转换系统（VC）系统进行训练和评估我们的NN模型。由于尚不知道如何最好地代表此任务的语音，因此我们将8个不同表示形式与基于MOSNET框架的功能进行比较。我们的表示包括基于图像的频谱图和X矢量嵌入，它们明确对不同类型的噪声（例如T60混响时间）进行了建模。我们的NN预测MOS与人类判断有很高的相关性。我们报告预测相关性和错误。一个关键的发现是，无论TTS或VC系统如何，对于某些说话者而言，取得的质量似乎是一致的。人们普遍认为，有些说话者比其他扬声器构建TTS系统的质量更高：我们的方法提供了一种自动识别此类扬声器的方法。最后，要查看我们的质量预测模型是否概括，我们使用单独的多扬声器TTS系统预测合成语音的质量得分，该系统接受了库列特数据的培训，并进行了我们自己的MOS听力测试，以将人类评分与我们的NN预测进行比较。

We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题