音频gan的潜在矢量恢复

论文标题

音频gan的潜在矢量恢复

Latent Vector Recovery of Audio GANs

论文作者

Keyes, Andrew, Bayat, Nicky, Khazaie, Vahid Reza, Mohsenzadeh, Yalda

论文摘要

高级生成对抗网络（GAN）在从随机潜在向量生成可理解的音频方面非常出色。在本文中，我们研究了恢复合成和真实音频的潜在向量的任务。以前的作品通过自动编码器启发的技术恢复了给定音频的潜在向量，该技术可以与GAN并行训练编码器网络，或者在训练发电机之后。通过我们的方法，我们训练一个深层的残留神经网络体系结构，以投影由Wavegan合成的音频，并以几乎相同的重建性能为相应的潜在空间。为了适应缺乏真实音频的原始潜在向量，我们优化了对真实音频样本和预测潜在向量重建音频之间感知损失的残差网络。在合成的音频的情况下，也将平均平方误差（MSE）在地面真相和恢复的潜在矢量之间也被最小化。当将几个梯度优化步骤应用于预测的潜在矢量时，我们进一步研究了音频重建性能。通过我们基于神经网络的真实和合成音频培训方法，我们能够预测一个与真实音频的合理重建相对应的潜在向量。即使我们评估了有关Wavegan的方法，我们提出的方法是通用的，可以应用于任何其他gan。

Advanced Generative Adversarial Networks (GANs) are remarkable in generating intelligible audio from a random latent vector. In this paper, we examine the task of recovering the latent vector of both synthesized and real audio. Previous works recovered latent vectors of given audio through an auto-encoder inspired technique that trains an encoder network either in parallel with the GAN or after the generator is trained. With our approach, we train a deep residual neural network architecture to project audio synthesized by WaveGAN into the corresponding latent space with near identical reconstruction performance. To accommodate for the lack of an original latent vector for real audio, we optimize the residual network on the perceptual loss between the real audio samples and the reconstructed audio of the predicted latent vectors. In the case of synthesized audio, the Mean Squared Error (MSE) between the ground truth and recovered latent vector is minimized as well. We further investigated the audio reconstruction performance when several gradient optimization steps are applied to the predicted latent vector. Through our deep neural network based method of training on real and synthesized audio, we are able to predict a latent vector that corresponds to a reasonable reconstruction of real audio. Even though we evaluated our method on WaveGAN, our proposed method is universal and can be applied to any other GANs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题