视觉变压器的多维模型压缩

论文标题

视觉变压器的多维模型压缩

Multi-Dimensional Model Compression of Vision Transformer

论文作者

Hou, Zejiang, Kung, Sun-Yuan

论文摘要

视觉变压器（VIT）最近引起了相当大的关注，但是巨大的计算成本仍然是实际部署的问题。以前的VIT修剪方法倾向于仅沿一个维度修剪模型，这可能会遭受过度降低并导致亚最佳模型质量。相比之下，我们主张一个多维的VIT压缩范式，并提议共同利用注意力头，神经元和序列维度的冗余减少。我们首先提出了一个基于统计依赖性的修剪标准，该标准可推广到不同维度，以识别有害组件。此外，我们将多维压缩作为优化，学习在三个维度上的最佳修剪策略，这些策略在计算预算下最大化压缩模型的准确性。我们的改编的高斯流程搜索可以解决该问题，并预期改进。实验结果表明，我们的方法有效地降低了各种VIT模型的计算成本。例如，我们的方法减少了40 \％的拖鞋，而对于DEIT和T2T-VIT模型而言，没有TOP-1的准确性损失，表现优于先前的最先前。

Vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely, which may suffer from excessive reduction and lead to sub-optimal model quality. In contrast, we advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly. We firstly propose a statistical dependence based pruning criterion that is generalizable to different dimensions for identifying deleterious components. Moreover, we cast the multi-dimensional compression as an optimization, learning the optimal pruning policy across the three dimensions that maximizes the compressed model's accuracy under a computational budget. The problem is solved by our adapted Gaussian process search with expected improvement. Experimental results show that our method effectively reduces the computational cost of various ViT models. For example, our method reduces 40\% FLOPs without top-1 accuracy loss for DeiT and T2T-ViT models, outperforming previous state-of-the-arts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题