通过句子表示近似对BERT的非任务特异性蒸馏

论文标题

通过句子表示近似对BERT的非任务特异性蒸馏

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

论文作者

Wu, Bowen, Zhang, Huan, Li, Mengyuan, Wang, Zongsheng, Feng, Qihang, Huang, Junhong, Wang, Baoxun

论文摘要

最近，由于BERT的有效性和普遍性，BERT已成为各种NLP深层模型的重要组成部分。但是，BERT的在线部署通常会被其大规模参数和高计算成本所阻止。有很多研究表明，知识蒸馏有效地将知识从BERT转移到具有较小参数尺寸的模型中。然而，当前的BERT蒸馏方法主要集中于任务指定的蒸馏，这种方法会导致失去BERT的一般语义知识，无法获得通用性。在本文中，我们提出了一个近似面向蒸馏框架的句子表示，该句子可以将预训练的BERT提炼成一个简单的基于LSTM的模型，而无需指定任务。与BERT一致，我们的蒸馏模型能够通过微调执行转移学习，以适应任何句子级的下游任务。此外，我们的模型可以进一步与特定于任务的蒸馏程序合作。对胶水基准的多个NLP任务的实验结果表明，我们的方法的表现优于其他特定于任务的蒸馏方法，甚至超过更大的模型，即Elmo，具有效率得到良好改善。

Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. Besides, our model can further cooperate with task-specific distillation procedures. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods or even much larger models, i.e., ELMO, with efficiency well-improved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题