Q-VIT：视觉变压器的完全可区分量化

论文标题

Q-VIT：视觉变压器的完全可区分量化

Q-ViT: Fully Differentiable Quantization for Vision Transformer

论文作者

Li, Zhexin, Yang, Tong, Wang, Peisong, Cheng, Jian

论文摘要

在本文中，我们提出了一种称为Q-Vit的视觉变压器（VIT）的完全可区分的量化方法，其中两个量级尺度和位宽度都是可学习的参数。具体而言，根据我们的观察，即VIT显示出不同的量化鲁棒性，我们利用头部宽度的位宽度来挤压Q-Vit的大小，同时保持性能。此外，我们提出了一种名为“可切换量表”的新技术，以解决量级和位宽度的联合训练中的收敛问题。这样，Q-Vit将VIT量化的限制推向了3位，而不会降低性能。此外，我们分析了VIT的每个体系结构组成部分的量化鲁棒性，并表明多头自我注意力（MSA）和高斯误差线性单元（GELU）是VIT量化的关键方面。这项研究提供了一些有关VIT量化的进一步研究的见解。在不同的VIT模型（例如DEIT和SWIN Transformer）上进行的广泛实验显示了我们量化方法的有效性。特别是，我们的方法优于最先进的统一量化方法，而Deit微型的量化方法则优于1.5％。

In this paper, we propose a fully differentiable quantization method for vision transformer (ViT) named as Q-ViT, in which both of the quantization scales and bit-widths are learnable parameters. Specifically, based on our observation that heads in ViT display different quantization robustness, we leverage head-wise bit-width to squeeze the size of Q-ViT while preserving performance. In addition, we propose a novel technique named switchable scale to resolve the convergence problem in the joint training of quantization scales and bit-widths. In this way, Q-ViT pushes the limits of ViT quantization to 3-bit without heavy performance drop. Moreover, we analyze the quantization robustness of every architecture component of ViT and show that the Multi-head Self-Attention (MSA) and the Gaussian Error Linear Units (GELU) are the key aspects for ViT quantization. This study provides some insights for further research about ViT quantization. Extensive experiments on different ViT models, such as DeiT and Swin Transformer show the effectiveness of our quantization method. In particular, our method outperforms the state-of-the-art uniform quantization method by 1.5% on DeiT-Tiny.

下载PDF全文

下载文献需遵守相关版权规定

论文标题