通过在3D空间中恢复令牌来学习观点 - 不足的视觉表示

论文标题

通过在3D空间中恢复令牌来学习观点 - 不足的视觉表示

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

论文作者

Shang, Jinghuan, Das, Srijan, Ryoo, Michael S.

论文摘要

人类在理解视觉皮层引起的观点变化方面非常灵活，从而支持3D结构的感知。相比之下，大多数从2D图像池学习视觉表示的计算机视觉模型通常无法概括新的相机观点。最近，视觉体系结构已转向无卷积的架构，视觉变压器，该构造是从图像贴片衍生的令牌上运行的。但是，这些变压器没有执行明确的操作来学习观点 - 敏捷的表示形式以进行视觉理解。为此，我们提出了一个3D令牌表示层（3DTRL），该层估计了视觉令牌的3D位置信息，并利用它来学习视图点 - 不可能的表示。 3DTRL的关键元素包括一个伪深度估计器和学习的相机矩阵，以对代币施加几何转换，并以无监督的方式训练。这些使3DTRL能够从2D贴片中恢复令牌的3D位置信息。实际上，3DTRL很容易插入变压器。我们的实验证明了3DTRL在许多视觉任务中的有效性，包括图像分类，多视频视频对准和动作识别。带有3DTRL的模型在所有任务中都超过了其骨干变压器，并添加了最小的计算。我们的代码可在https://github.com/elicassion/3dtrl上找到。

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题