自适应稀疏vit：通过充分利用自我注意力来学习可学习的自适应代币修剪

论文标题

自适应稀疏vit：通过充分利用自我注意力来学习可学习的自适应代币修剪

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

论文作者

Liu, Xiangcheng, Wu, Tianyi, Guo, Guodong

论文摘要

Vision Transformer已成为计算机视觉中的新范式，表现出出色的性能，同时还具有昂贵的计算成本。图像令牌修剪是VIT压缩的主要方法之一，这是因为相对于令牌数的复杂性是二次的，而许多仅包含背景区域的令牌并不能真正促进最终预测。现有作品要么依赖其他模块来评分单个令牌的重要性，要么针对不同的输入实例实施固定比率修剪策略。在这项工作中，我们提出了一个自适应的稀疏令牌修剪框架，成本最低。具体而言，我们首先提出了廉价的注意力头部重要性的加权班级注意力评分机制。然后，将可学习的参数插入为阈值，以区分信息令牌和不重要的令牌。通过比较令牌注意分数和阈值，我们可以从层次上丢弃无用的令牌，从而加速推理。可学习的阈值在预算感知培训中进行了优化，以平衡准确性和复杂性，并为不同的输入实例执行相应的修剪配置。广泛的实验证明了我们方法的有效性。我们的方法将DEIT-S的吞吐量提高了50％，而TOP-1的准确性仅下降了0.2％，这比以前的方法在准确性和延迟之间取得了更好的权衡。

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts that the complexity is quadratic with respect to the token number, and many tokens containing only background regions do not truly contribute to the final prediction. Existing works either rely on additional modules to score the importance of individual tokens, or implement a fixed ratio pruning strategy for different input instances. In this work, we propose an adaptive sparse token pruning framework with a minimal cost. Specifically, we firstly propose an inexpensive attention head importance weighted class attention scoring mechanism. Then, learnable parameters are inserted as thresholds to distinguish informative tokens from unimportant ones. By comparing token attention scores and thresholds, we can discard useless tokens hierarchically and thus accelerate inference. The learnable thresholds are optimized in budget-aware training to balance accuracy and complexity, performing the corresponding pruning configurations for different input instances. Extensive experiments demonstrate the effectiveness of our approach. Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy, which achieves a better trade-off between accuracy and latency than the previous methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题