为什么蒸馏会有所帮助：统计观点

论文标题

为什么蒸馏会有所帮助：统计观点

Why distillation helps: a statistical perspective

论文作者

Menon, Aditya Krishna, Rawat, Ankit Singh, Reddi, Sashank J., Kim, Seungyeon, Kumar, Sanjiv

论文摘要

知识蒸馏是一种通过替换从复杂的“老师”模型获得的标签的分布来代替其单热培训标签，来改善简单“学生”模型的性能。尽管这种简单的方法已被证明广泛有效，但一个基本问题仍未解决：为什么蒸馏有帮助？在本文中，我们介绍了有关蒸馏的统计观点，该观点解决了这个问题，并提供了与极端多类检索技术的新联系。我们的核心观察是，教师试图估计基本的（贝叶斯）概况函数。在此基础上，我们在学生的目标中建立了基本的偏见差异权衡：这量量化了对这些课堂概念的近似知识如何极大地帮助学习。最后，我们展示了蒸馏如何为极端多类检索的现有负面采矿技术补充，并提出了结合这些思想的统一目标。

Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques. Our core observation is that the teacher seeks to estimate the underlying (Bayes) class-probability function. Building on this, we establish a fundamental bias-variance tradeoff in the student's objective: this quantifies how approximate knowledge of these class-probabilities can significantly aid learning. Finally, we show how distillation complements existing negative mining techniques for extreme multiclass retrieval, and propose a unified objective which combines these ideas.

下载PDF全文

下载文献需遵守相关版权规定

论文标题