论文标题
视觉变压器的多模式令牌融合
Multimodal Token Fusion for Vision Transformers
论文作者
论文摘要
已经出现了许多变形金刚的改编,以解决单模式视觉任务,在该任务中,自我发项模块被堆叠以处理图像之类的输入源。直觉上,将多种数据馈送到视觉变压器可以提高性能,但是内模式的细心权重也可能会稀释,从而可能破坏最终性能。在本文中,我们提出了一种针对基于变压器的视力任务量身定制的多模式令牌融合方法(TokenFusion)。为了有效地融合多种方式,TokenFusion动态检测非信息的令牌,并用投影和聚合的模式间特征代替这些令牌。还采用了残留的位置一致性来实现融合后模式间比对的明确利用。 TokenFusion的设计使变压器可以学习多模式特征之间的相关性,而单模式变压器体系结构在很大程度上保持完整。进行了广泛的实验,是对各种均匀和异质方式进行的,并证明TokenFusion在三个典型的视觉任务中超过了最新的方法:多模式图像对图像对图像 - 图像到图像转换,RGB深度语义分段,以及与点云和图像的3D对象检测3D对象。我们的代码可从https://github.com/yikaiw/tokenfusion获得。
Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.