SP-Vit：学习2D Vision Transformers的2D空间先验

论文标题

SP-Vit：学习2D Vision Transformers的2D空间先验

SP-ViT: Learning 2D Spatial Priors for Vision Transformers

论文作者

Zhou, Yuxuan, Xiang, Wangmeng, Li, Chao, Wang, Biao, Wei, Xihan, Zhang, Lei, Keuper, Margret, Hua, Xiansheng

论文摘要

最近，变形金刚在图像分类中表现出巨大的潜力，并在ImageNet基准上建立了最先进的结果。但是，与CNN相比，变压器会缓慢收敛，并且由于缺乏空间电感偏置而容易过度拟合在低数据表中。这种空间电感偏差可能特别有益，因为输入图像的2D结构在变压器中不能很好地保存。在这项工作中，我们提出了空间先验增强的自我注意力（SP-SA），这是一种针对视觉变压器量身定制的香草自我注意力（SA）的新型变体。空间先验（SP）是我们提出的归纳偏见家族，突出了某些空间关系。与卷积的归纳偏见不同，被迫专注于硬编码的本地区域，我们提出的SP是由模型本身学到的，并考虑了各种空间关系。具体而言，注意力评分是在每个头部都强调某些空间关系的重点，并且这种学识渊博的空间焦点可以彼此互补。基于SP-SA，我们提出了SP-VIT家族，该家族始终优于其他具有相似GFLOPS或参数的VIT模型。 Our largest model SP-ViT-L achieves a record-breaking 86.3% Top-1 accuracy with a reduction in the number of parameters by almost 50% compared to previous state-of-the-art model (150M for SP-ViT-L vs 271M for CaiT-M-36) among all ImageNet-1K models trained on 224x224 and fine-tuned on 384x384 resolution w/o extra data.

Recently, transformers have shown great potential in image classification and established state-of-the-art results on the ImageNet benchmark. However, compared to CNNs, transformers converge slowly and are prone to overfitting in low-data regimes due to the lack of spatial inductive biases. Such spatial inductive biases can be especially beneficial since the 2D structure of an input image is not well preserved in transformers. In this work, we present Spatial Prior-enhanced Self-Attention (SP-SA), a novel variant of vanilla Self-Attention (SA) tailored for vision transformers. Spatial Priors (SPs) are our proposed family of inductive biases that highlight certain groups of spatial relations. Unlike convolutional inductive biases, which are forced to focus exclusively on hard-coded local regions, our proposed SPs are learned by the model itself and take a variety of spatial relations into account. Specifically, the attention score is calculated with emphasis on certain kinds of spatial relations at each head, and such learned spatial foci can be complementary to each other. Based on SP-SA we propose the SP-ViT family, which consistently outperforms other ViT models with similar GFlops or parameters. Our largest model SP-ViT-L achieves a record-breaking 86.3% Top-1 accuracy with a reduction in the number of parameters by almost 50% compared to previous state-of-the-art model (150M for SP-ViT-L vs 271M for CaiT-M-36) among all ImageNet-1K models trained on 224x224 and fine-tuned on 384x384 resolution w/o extra data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题