使用硬件友好型块结构化修剪基于高效的变压器大规模语言表示

论文标题

使用硬件友好型块结构化修剪基于高效的变压器大规模语言表示

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

论文作者

Li, Bingbing, Kong, Zhenglun, Zhang, Tianyun, Li, Ji, Li, Zhengang, Liu, Hang, Ding, Caiwen

论文摘要

预先训练的大规模语言模型越来越表现出许多自然语言处理（NLP）任务的高精度。但是，硬件平台上的重量存储和计算速度有限阻碍了预培训模型的普及，尤其是在边缘计算时代。在这项工作中，我们建议使用硬件友好的块结构修剪基于有效的基于变压器的大规模语言表示。我们将重量的套索纳入块结构的修剪以进行优化。除了显着减少重量存储和计算外，提出的方法还达到了高压率。关于一般语言理解评估（GLUE）基准任务的不同模型（BERT，ROBERTA和DISTILBERT）的实验结果表明，我们在某些任务上以零或较小的准确性降级达到5.0倍。我们提出的方法也与现有的紧凑型预训练的语言模型（例如Distilbert）使用知识蒸馏正交，因为在Distilbert的顶部可以达到1.79倍的平均压缩率，并具有零或较小的精度降解。它适合在资源受限的边缘设备上部署最终压缩模型。

Pre-trained large-scale language models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. However, the limited weight storage and computational speed on hardware platforms have impeded the popularity of pre-trained models, especially in the era of edge computing. In this work, we propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization. Besides the significantly reduced weight storage and computation, the proposed approach achieves high compression rates. Experimental results on different models (BERT, RoBERTa, and DistilBERT) on the General Language Understanding Evaluation (GLUE) benchmark tasks show that we achieve up to 5.0x with zero or minor accuracy degradation on certain task(s). Our proposed method is also orthogonal to existing compact pre-trained language models such as DistilBERT using knowledge distillation, since a further 1.79x average compression rate can be achieved on top of DistilBERT with zero or minor accuracy degradation. It is suitable to deploy the final compressed model on resource-constrained edge devices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题