寻找更好的学生学习蒸馏知识

论文标题

寻找更好的学生学习蒸馏知识

Search for Better Students to Learn Distilled Knowledge

论文作者

Gu, Jindong, Tresp, Volker

论文摘要

作为模型压缩技术，知识蒸馏引起了极大的关注。良好表现的老师的知识被蒸馏到具有小型建筑的学生。小型学生的建筑通常被选择与他们的老师相似，较少或更少的渠道或两者兼而有之。但是，即使具有相同数量的拖台或参数，具有不同体系结构的学生也可以实现不同的概括能力。学生体系结构的配置需要密集的网络体系结构工程。在这项工作中，我们建议自动搜索最佳学生，而不是手动设计好的学生架构。基于L1-Norm优化，从教师网络拓扑图中选择了一个子图作为学生，其目标是最大程度地减少学生和教师的成果之间的KL差异。我们验证CIFAR10和CIFAR100数据集的建议。经验实验表明，学识渊博的学生体系结构的性能比手动指定的表现更好。我们还可以看到并理解找到学生的架构。

Knowledge Distillation, as a model compression technique, has received great attention. The knowledge of a well-performed teacher is distilled to a student with a small architecture. The architecture of the small student is often chosen to be similar to their teacher's, with fewer layers or fewer channels, or both. However, even with the same number of FLOPs or parameters, the students with different architecture can achieve different generalization ability. The configuration of a student architecture requires intensive network architecture engineering. In this work, instead of designing a good student architecture manually, we propose to search for the optimal student automatically. Based on L1-norm optimization, a subgraph from the teacher network topology graph is selected as a student, the goal of which is to minimize the KL-divergence between student's and teacher's outputs. We verify the proposal on CIFAR10 and CIFAR100 datasets. The empirical experiments show that the learned student architecture achieves better performance than ones specified manually. We also visualize and understand the architecture of the found student.

下载PDF全文

下载文献需遵守相关版权规定

论文标题