论文标题
基于合成器的视力任务有效自我注意
Synthesizer Based Efficient Self-Attention for Vision Tasks
论文作者
论文摘要
自我发场模块在捕获长期关系的同时,在提高视觉任务的性能(例如图像分类和图像字幕)上显示出了出色的能力。但是,自我发项模块高度依赖于查询键值特征之间的点产物乘法和维度对齐,这引起了两个问题:(1)点乘积乘法导致详尽而冗余的计算。 (2)由于视觉特征图通常以多维张量出现,重塑了张量特征的比例以适应尺寸对齐,可能会破坏张量特征映射的内部结构。为了解决这些问题,本文提出了一个带有其变体的自我发挥插件模块,即合成张量转换(STT),以直接处理图像张量的特征。如果不计算查询键值之间的点产生乘积,则基本STT由张量转换组成,以从视觉信息中学习合成的注意力。 STT系列的有效性在图像分类和图像标题上进行了验证。实验表明,与上述视觉任务中的自我注意力相比,所提出的STT在保持稳健性的同时,保持了竞争性能。
Self-attention module shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning. However, the self-attention module highly relies on the dot product multiplication and dimension alignment among query-key-value features, which cause two problems: (1) The dot product multiplication results in exhaustive and redundant computation. (2) Due to the visual feature map often appearing as a multi-dimensional tensor, reshaping the scale of the tensor feature to adapt to the dimension alignment might destroy the internal structure of the tensor feature map. To address these problems, this paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Tensor Transformations (STT), for directly processing image tensor features. Without computing the dot-product multiplication among query-key-value, the basic STT is composed of the tensor transformation to learn the synthetic attention weight from visual information. The effectiveness of STT series is validated on the image classification and image caption. Experiments show that the proposed STT achieves competitive performance while keeping robustness compared to self-attention in the aforementioned vision tasks.