论文标题
语音的基准测试生成潜在变量模型
Benchmarking Generative Latent Variable Models for Speech
论文作者
论文摘要
随机潜在变量模型(LVM)在自然图像生成上实现最新性能,但仍然不如言语确定性模型。在本文中,我们开发了流行的时间LVM的语音基准,并将其与最新的确定性模型进行了比较。我们报告的可能性是图像域中使用众多的度量,但很少或无与伦比的语音模型报告。为了评估学习表现的质量,我们还比较了它们对音素识别的有用性。最后,我们将发条vae(用于视频生成的最先进的时间LVM)改编成语音域。尽管仅在潜在空间中进行自回归,但我们发现发条vae可以通过使用潜在变量的层次结构来胜过以前的LVM,并减少差距到确定性模型。
Stochastic latent variable models (LVMs) achieve state-of-the-art performance on natural image generation but are still inferior to deterministic models on speech. In this paper, we develop a speech benchmark of popular temporal LVMs and compare them against state-of-the-art deterministic models. We report the likelihood, which is a much used metric in the image domain, but rarely, or incomparably, reported for speech models. To assess the quality of the learned representations, we also compare their usefulness for phoneme recognition. Finally, we adapt the Clockwork VAE, a state-of-the-art temporal LVM for video generation, to the speech domain. Despite being autoregressive only in latent space, we find that the Clockwork VAE can outperform previous LVMs and reduce the gap to deterministic models by using a hierarchy of latent variables.