论文标题

为什么蒸馏会有所帮助:统计观点

Why distillation helps: a statistical perspective

论文作者

Menon, Aditya Krishna, Rawat, Ankit Singh, Reddi, Sashank J., Kim, Seungyeon, Kumar, Sanjiv

论文摘要

知识蒸馏是一种通过替换从复杂的“老师”模型获得的标签的分布来代替其单热培训标签,来改善简单“学生”模型的性能。尽管这种简单的方法已被证明广泛有效,但一个基本问题仍未解决:为什么蒸馏有帮助?在本文中,我们介绍了有关蒸馏的统计观点,该观点解决了这个问题,并提供了与极端多类检索技术的新联系。我们的核心观察是,教师试图估计基本的(贝叶斯)概况函数。在此基础上,我们在学生的目标中建立了基本的偏见差异权衡:这量量化了对这些课堂概念的近似知识如何极大地帮助学习。最后,我们展示了蒸馏如何为极端多类检索的现有负面采矿技术补充,并提出了结合这些思想的统一目标。

Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques. Our core observation is that the teacher seeks to estimate the underlying (Bayes) class-probability function. Building on this, we establish a fundamental bias-variance tradeoff in the student's objective: this quantifies how approximate knowledge of these class-probabilities can significantly aid learning. Finally, we show how distillation complements existing negative mining techniques for extreme multiclass retrieval, and propose a unified objective which combines these ideas.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源