进行细粒度可控的说话头综合的进行性分解的表示学习

论文标题

进行细粒度可控的说话头综合的进行性分解的表示学习

Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis

论文作者

Wang, Duomin, Deng, Yu, Yin, Zixin, Shum, Heung-Yeung, Wang, Baoyuan

论文摘要

我们提出了一种新颖的单次通话头合成方法，该方法可实现对唇部运动，眼睛凝视和眨眼，头部姿势和情感表达的分离和细粒度的控制。我们通过分离的潜在表示表示不同的动作，并利用图像生成器来综合他们的说话头。为了有效地解开每个运动因子，我们通过以粗到精细的方式将因子分离出来，提出了一个渐进的分离表示的学习策略，我们首先将统一运动特征从驾驶信号中提取，然后将每个细粒运动与统一功能分离。我们介绍了针对非情感动作的运动特异性对比度学习和回归，并为情绪表达提供特征级别的去相关和自我重构，以充分利用非结构化视频数据中每个运动因子的固有特性，以实现分离。实验表明，我们的方法提供了高质量的语音和唇 - 动作同步，以及对多个额外面部运动的精确和分离的控制，这几乎无法通过以前的方法来实现。

We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them. To effectively disentangle each motion factor, we propose a progressive disentangled representation learning strategy by separating the factors in a coarse-to-fine manner, where we first extract unified motion feature from the driving signal, and then isolate each fine-grained motion from the unified feature. We introduce motion-specific contrastive learning and regressing for non-emotional motions, and feature-level decorrelation and self-reconstruction for emotional expression, to fully utilize the inherent properties of each motion factor in unstructured video data to achieve disentanglement. Experiments show that our method provides high quality speech&lip-motion synchronization along with precise and disentangled control over multiple extra facial motions, which can hardly be achieved by previous methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题