论文标题
Mocovit:移动卷积视觉变压器
MoCoViT: Mobile Convolutional Vision Transformer
论文作者
论文摘要
最近,变压器网络在各种视觉任务上取得了令人印象深刻的结果。但是,其中大多数在计算上昂贵,不适合现实世界移动应用程序。在这项工作中,我们提出了移动卷积视觉变压器(Mocovit),该卷卷动视觉变压器(Mocovit)通过将变压器引入移动卷积网络来利用这两种体系结构的好处,从而提高了性能和效率。与最近有关视觉变压器的作品不同,Mocovit中的移动变压器块是为移动设备设计的,非常轻巧,通过两种主要修改完成:移动自我注意力(MOSA)模块和移动馈电远期网络(MOFFN)。 MOSA通过分支共享方案简化了注意力图的计算,而MOFFN则是变压器中MLP的移动版本,从而进一步降低了计算。全面的实验验证了我们提议的Mocovit家族在各种视觉任务上都优于最先进的便携式CNN和变压器神经体系结构。在ImageNet分类方面,它在1.47亿杆时达到了74.5%的TOP-1准确性,在较少的计算中,MobileNetV3比MobilenetV3获得了1.2%。在可可对象检测任务上,Mocovit在视网膜框架中优于2.1 AP的幽灵。
Recently, Transformer networks have achieved impressive results on a variety of vision tasks. However, most of them are computationally expensive and not suitable for real-world mobile applications. In this work, we present Mobile Convolutional Vision Transformer (MoCoViT), which improves in performance and efficiency by introducing transformer into mobile convolutional networks to leverage the benefits of both architectures. Different from recent works on vision transformer, the mobile transformer block in MoCoViT is carefully designed for mobile devices and is very lightweight, accomplished through two primary modifications: the Mobile Self-Attention (MoSA) module and the Mobile Feed Forward Network (MoFFN). MoSA simplifies the calculation of the attention map through Branch Sharing scheme while MoFFN serves as a mobile version of MLP in the transformer, further reducing the computation by a large margin. Comprehensive experiments verify that our proposed MoCoViT family outperform state-of-the-art portable CNNs and transformer neural architectures on various vision tasks. On ImageNet classification, it achieves 74.5% top-1 accuracy at 147M FLOPs, gaining 1.2% over MobileNetV3 with less computations. And on the COCO object detection task, MoCoViT outperforms GhostNet by 2.1 AP in RetinaNet framework.