通过学习动态纹理和渲染到视频翻译的神经人类视频渲染

论文标题

通过学习动态纹理和渲染到视频翻译的神经人类视频渲染

Neural Human Video Rendering by Learning Dynamic Textures and Rendering-to-Video Translation

论文作者

Liu, Lingjie, Xu, Weipeng, Habermann, Marc, Zollhoefer, Michael, Bernard, Florian, Kim, Hyeongwoo, Wang, Wenping, Theobalt, Christian

论文摘要

使用神经网络综合人类的现实视频已成为基于图形的传统渲染管道的一种流行替代，因为它的效率很高。现有作品通常将其作为2D屏幕空间中的图像到图像翻译问题，从而导致伪像，例如过度光滑，缺失的身体部位和精细细节的时间不稳定性，例如服装中的姿势依赖性皱纹。在本文中，我们提出了一种新型的人类视频综合方法，该方法通过明确阐明时间相互尺度的细节与人类在2D屏幕空间中的嵌入中的学习来了解这些限制因素。更具体地说，我们的方法依赖于两个卷积神经网络（CNN）的组合。鉴于姿势信息，第一个CNN预测了一个动态纹理图，其中包含时间连接的高频细节，第二个CNN会在第一个CNN的时间相干输出上生成最终视频的生成。我们展示了我们的方法的几种应用，例如人类的重演和单眼视频的新观点综合，在质量和定量上，我们对最新技术的表现出显着改善。

Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题