论文标题
与变压器的3D人网恢复的分离模态的交叉注意
Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers
论文作者
论文摘要
变压器编码器架构最近在单眼3D人网重建方面取得了最新的结果,但是它们需要大量的参数和昂贵的计算。由于内存较大和推理速度缓慢,因此很难将这种模型用于实际使用。在本文中,我们提出了一种新型的变压器编码器编码器架构,用于从单个图像(称为fastmetro)的3D人网格重建。我们确定基于编码器的变压器中的性能瓶颈是由令牌设计引起的,该设计引入了输入令牌之间的高复杂性相互作用。我们通过编码器解码器体系结构解开交互,这使我们的模型可以要求更少的参数和更短的推理时间。此外,我们通过注意力掩盖和网状提升采样操作对人体形态关系的先验知识施加了知识,从而导致更快的融合具有更高的准确性。我们的FastMetro提高了准确性和效率的帕累托前,并且显然超过了基于图像的36m和3dpw的基于图像的方法。此外,我们验证了其对弗里德汉的概括性。
Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction, but they require a substantial number of parameters and expensive computations. Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use. In this paper, we propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO. We identify the performance bottleneck in the encoder-based transformers is caused by the token design which introduces high complexity interactions among input tokens. We disentangle the interactions via an encoder-decoder architecture, which allows our model to demand much fewer parameters and shorter inference time. In addition, we impose the prior knowledge of human body's morphological relationship via attention masking and mesh upsampling operations, which leads to faster convergence with higher accuracy. Our FastMETRO improves the Pareto-front of accuracy and efficiency, and clearly outperforms image-based methods on Human3.6M and 3DPW. Furthermore, we validate its generalizability on FreiHAND.