PPT：用于单眼和多视图人类姿势估计的令牌螺旋姿势变压器

论文标题

PPT：用于单眼和多视图人类姿势估计的令牌螺旋姿势变压器

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

论文作者

Ma, Haoyu, Wang, Zhe, Chen, Yifei, Kong, Deying, Chen, Liangjian, Liu, Xingwei, Yan, Xiangyi, Tang, Hao, Xie, Xiaohui

论文摘要

最近，视觉变压器及其变体在人类和多视图人类姿势估计中都起着越来越重要的作用。将图像补丁视为令牌，变形金刚可以对整个图像中的全局依赖项进行建模，也可以从其他视图中跨图像进行建模。但是，全球关注在计算上是昂贵的。结果，很难将这些基于变压器的方法扩展到高分辨率特征和许多视图。在本文中，我们提出了代币螺旋的姿势变压器（PPT）进行2D人姿势估计，该姿势估计可以定位粗糙的人掩模，并且只能在选定的令牌内进行自我注意。此外，我们将PPT扩展到多视图人类姿势估计。我们建立在PPT的基础上，提出了一种新的跨视图融合策略，称为人类区域融合，该策略将所有人类前景像素视为相应的候选者。可可和MPII的实验结果表明，我们的PPT可以在减少计算的同时匹配先前姿势变压器方法的准确性。此外，对人类360万和滑雪的实验表明，我们的多视图PPT可以有效地从多个视图中融合线索并获得新的最新结果。

Recently, the vision transformer and its variants have played an increasingly important role in both monocular and multi-view human pose estimation. Considering image patches as tokens, transformers can model the global dependencies within the entire image or across images from other views. However, global attention is computationally expensive. As a consequence, it is difficult to scale up these transformer-based methods to high-resolution features and many views. In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens. Furthermore, we extend our PPT to multi-view human pose estimation. Built upon PPT, we propose a new cross-view fusion strategy, called human area fusion, which considers all human foreground pixels as corresponding candidates. Experimental results on COCO and MPII demonstrate that our PPT can match the accuracy of previous pose transformer methods while reducing the computation. Moreover, experiments on Human 3.6M and Ski-Pose demonstrate that our Multi-view PPT can efficiently fuse cues from multiple views and achieve new state-of-the-art results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题