在神经tts中建模，以提高噪声中合成语音的清晰度

论文标题

在神经tts中建模，以提高噪声中合成语音的清晰度

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

论文作者

Raitio, Tuomo, Petkov, Petko, Li, Jiangchuan, Shifas, Muhammed, Davis, Andrea, Stylianou, Yannis

论文摘要

我们提出了一种神经文本到语音（TTS）方法，该方法对自然的声音努力变化进行建模，以提高在存在噪声的情况下合成语音的清晰度。该方法包括首先测量未标记的常规语音数据的频谱倾斜，然后在其他韵律因素和其他韵律因素和其他韵律因素和其他频谱倾斜的情况下调节神经TTS模型。更改频谱倾斜参数并保持其他韵律因素不变，可以在合成时间内有效地控制与其他韵律因素无关。通过推断光谱倾斜值超出了原始数据中所看到的，我们可以以高发声量的水平产生语音，从而提高在存在掩盖噪声的情况下语音的清晰度。我们在存在各种掩盖噪声条件下的声音努力中评估了正常语音和语音的清晰度和质量，并将其与众所周知的语音可理解性增强算法进行比较。评估表明，所提出的方法可以提高综合语音的清晰度，而语音质量却很少。

We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt parameter and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题