论文标题
高保真长视频生成的潜在视频扩散模型
Latent Video Diffusion Models for High-Fidelity Long Video Generation
论文作者
论文摘要
AI生成的内容最近引起了很多关注,但是照片真实的视频综合仍然具有挑战性。尽管已经在该领域进行了许多使用gan和自回归模型的尝试,但是产生的视频的视觉质量和长度远非令人满意。扩散模型最近显示出了显着的结果,但需要大量的计算资源。为了解决这个问题,我们通过利用低维3D潜在空间来介绍轻型视频扩散模型,在有限的计算预算下,以前的像素空间扩散模型的表现大大优于以前的像素空间扩散模型。此外,我们提出了潜在空间中的层次扩散,以便可以产生超过一千帧的更长视频。为了进一步克服长期视频生成的性能退化问题,我们提出了有条件的潜在扰动和无条件的指导,从而有效地减轻了视频长度扩展期间累积错误。在不同类别的小型域数据集上进行的广泛实验表明,与以前的强大基线相比,我们的框架产生更现实和更长的视频。我们还提供了大规模文本到视频生成的扩展,以证明我们作品的优越性。我们的代码和模型将公开可用。
AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.