特征金字塔变压器

论文标题

特征金字塔变压器

Feature Pyramid Transformer

论文作者

Zhang, Dong, Zhang, Hanwang, Tang, Jinhui, Wang, Meng, Hua, Xiansheng, Sun, Qianru

论文摘要

跨空间和秤的特征相互作用是现代视觉识别系统的基础，因为它们引入了有益的视觉环境。从传统上讲，空间上下文被动地隐藏在CNN日益增加的接收场中或通过非本地卷积进行积极编码。但是，非本地空间相互作用并非跨量表，因此它们无法捕获以不同尺度的对象（或部分）的非本地上下文。为此，我们提出了在空间和鳞片上的完全活跃的特征相互作用，称为特征金字塔变压器（FPT）。它通过使用三个专门设计的变压器，以自我级别，自上而下和自下而上的互动方式将任何特征金字塔变成了另一个具有相同尺寸的特征金字塔。 FPT用作具有公平计算开销的通用视觉主链。我们使用各种骨干和头部网络进行实例级级（即对象检测和实例分割）和像素级分割任务的广泛实验，并观察到对所有基准和最新方法的一致改进。

Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT). It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks, using various backbones and head networks, and observe consistent improvement over all the baselines and the state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题