论文标题
知识蒸馏的三胞胎损失
Triplet Loss for Knowledge Distillation
论文作者
论文摘要
近年来,深度学习已经迅速传播,并且提出了更深入的更大模型。但是,随着模型的大小变大,计算成本变得巨大。已经提出了各种用于压缩模型大小的技术,以提高性能,同时降低计算成本。压缩模型大小的方法之一是知识蒸馏(KD)。知识蒸馏是一种将具有许多参数(教师模型)的深层或整体模型知识转移到较小的浅层模型(学生模型)的技术。由于知识蒸馏的目的是增加教师模型与学生模型之间的相似性,因此我们建议将公制学习的概念介绍为知识蒸馏,以使学生模型使用培训样本的对或三胞胎更接近教师模型。在公制学习中,研究人员正在开发建立一个模型的方法,该模型可以增加相似样本的产出相似性。公制学习旨在减少相似和增加不同之间的距离之间的距离。度量学习以减少相似输出之间的差异的功能可用于知识蒸馏,以减少教师模型和学生模型的产出之间的差异。由于通常不同对象的教师模型的输出通常不同,因此学生模型需要区分它们。我们认为,度量学习可以阐明不同的产出之间的差异,并且可以提高学生模型的表现。我们已经进行了实验,以将所提出的方法与最先进的知识蒸馏方法进行比较。
In recent years, deep learning has spread rapidly, and deeper, larger models have been proposed. However, the calculation cost becomes enormous as the size of the models becomes larger. Various techniques for compressing the size of the models have been proposed to improve performance while reducing computational costs. One of the methods to compress the size of the models is knowledge distillation (KD). Knowledge distillation is a technique for transferring knowledge of deep or ensemble models with many parameters (teacher model) to smaller shallow models (student model). Since the purpose of knowledge distillation is to increase the similarity between the teacher model and the student model, we propose to introduce the concept of metric learning into knowledge distillation to make the student model closer to the teacher model using pairs or triplets of the training samples. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. Metric learning aims at reducing the distance between similar and increasing the distance between dissimilar. The functionality of the metric learning to reduce the differences between similar outputs can be used for the knowledge distillation to reduce the differences between the outputs of the teacher model and the student model. Since the outputs of the teacher model for different objects are usually different, the student model needs to distinguish them. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved. We have performed experiments to compare the proposed method with state-of-the-art knowledge distillation methods.