通过模仿学习的自回归知识蒸馏

论文标题

通过模仿学习的自回归知识蒸馏

Autoregressive Knowledge Distillation through Imitation Learning

论文作者

Lin, Alexander, Wohlwend, Jeremy, Chen, Howard, Lei, Tao

论文摘要

由于采用了深厚的自我牵手建筑，自然语言生成任务的自然模型的性能已大大改善。但是，这些收益是以阻碍推理速度为代价的，使最先进的模型繁琐地部署在现实世界中的时间敏感设置中。我们为自回归模型开发了一种压缩技术，该模型由知识蒸馏的模仿学习观点所驱动。该算法旨在解决暴露偏置问题。在诸如翻译和摘要之类的典型语言生成任务上，我们的方法一致地优于其他蒸馏算法，例如序列级知识蒸馏。接受我们方法培训的学生模型的成绩比从头开始训练的训练的1.4至4.8 BLEU/Rouge点要高，同时与教师模型相比，推理速度最多提高了14次。

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题