从语义的角度揭开视觉变压器中的自我注意力：分析和应用

论文标题

从语义的角度揭开视觉变压器中的自我注意力：分析和应用

Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application

论文作者

Wu, Leijie, Guo, Song, Ding, Yaohong, Wang, Junxiao, Xu, Wenchao, Xu, Richard Yida, Zhang, Jie

论文摘要

自我注意力的机制，尤其是多头自我注意力（MSA），在计算机视觉和自然语言处理等许多领域都取得了巨大的成功。但是，许多现有的视觉变压器（VIT）的作用简直是从NLP来调整视觉任务的固有的变压器设计，同时忽略了``MSA如何在图像和语言设置中的工作方式''之间的基本差异。语言自然包含人类直接解释的高度语义结构。它的基本单元（单词）是离散的，没有冗余信息，它很容易支持有关语言变压器机制的可解释研究。相比之下，视觉数据具有根本不同的结构：其基本单元（像素）是自然的低级表示，在邻里中具有显着的冗余，这对VIT中MSA机制的可解释性构成了明显的挑战。在本文中，我们介绍了一种典型的图像处理技术，即比例不变特征变换（SIFT），该技术将低级表示形式映射到中层空间中，并注释具有语义丰富信息的广泛离散关键。 Next, we construct a weighted patch interrelation analysis based on SIFT keypoints to capture the attention patterns hidden in patches with different semantic concentrations Interestingly, we find this quantitative analysis is not only an effective complement to the interpretability of MSA mechanisms in ViT, but can also be applied to 1) spurious correlation discovery and ``prompting'' during model inference, 2) and guided model pre-training acceleration.两种应用的实验结果都显示出与基准相比的显着优势，这表明了我们方法的功效。

Self-attention mechanisms, especially multi-head self-attention (MSA), have achieved great success in many fields such as computer vision and natural language processing. However, many existing vision transformer (ViT) works simply inherent transformer designs from NLP to adapt vision tasks, while ignoring the fundamental difference between ``how MSA works in image and language settings''. Language naturally contains highly semantic structures that are directly interpretable by humans. Its basic unit (word) is discrete without redundant information, which readily supports interpretable studies on MSA mechanisms of language transformer. In contrast, visual data exhibits a fundamentally different structure: Its basic unit (pixel) is a natural low-level representation with significant redundancies in the neighbourhood, which poses obvious challenges to the interpretability of MSA mechanism in ViT. In this paper, we introduce a typical image processing technique, i.e., scale-invariant feature transforms (SIFTs), which maps low-level representations into mid-level spaces, and annotates extensive discrete keypoints with semantically rich information. Next, we construct a weighted patch interrelation analysis based on SIFT keypoints to capture the attention patterns hidden in patches with different semantic concentrations Interestingly, we find this quantitative analysis is not only an effective complement to the interpretability of MSA mechanisms in ViT, but can also be applied to 1) spurious correlation discovery and ``prompting'' during model inference, 2) and guided model pre-training acceleration. Experimental results on both applications show significant advantages over baselines, demonstrating the efficacy of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题