蒙面的自动编码器可实现有效的知识蒸馏器

论文标题

蒙面的自动编码器可实现有效的知识蒸馏器

Masked Autoencoders Enable Efficient Knowledge Distillers

论文作者

Bai, Yutong, Wang, Zeyu, Xiao, Junfei, Wei, Chen, Wang, Huiyu, Yuille, Alan, Zhou, Yuyin, Xie, Cihang

论文摘要

本文研究了从预先训练的模型，尤其是蒙面自动编码器中提取知识的潜力。我们的方法很简单：除了优化掩盖输入上的像素重建损失外，我们还将教师模型的中间特征图与学生模型的中间特征图之间的距离最小化。该设计导致一个计算高效的知识蒸馏框架，给定1）仅使用一个少量可见的贴片子集，2）（繁琐的）教师模型仅需要部分执行，即在前几层中向前传播输入，以获取中间标志。与直接提炼微型模型相比，提取预训练的模型可大大提高下游性能。例如，通过将知识从MAE预训练的VIT-L提炼到VIT-B中，我们的方法可实现84.0％的Imagenet Top-1精度，表现优于直接将微型VIT-L的基线提高1.2％。更有趣的是，即使具有极高的掩盖率，我们的方法也可以从教师模型中鲁棒提炼知识：例如，在蒸馏过程中仅可见十个斑块的掩盖比，我们的VIT-B竞争能够达到83.6％的Top-1 Imagenet精度；令人惊讶的是，它仍然可以通过仅使用四个可见斑（98％掩盖率）积极训练来确保82.4％的Top-1 Imagenet精度。代码和模型可在https://github.com/ucsc-vlaa/dmae上公开获得。

This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, ie, forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy, outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%. More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can still secure 82.4% top-1 ImageNet accuracy by aggressively training with just FOUR visible patches (98% masking ratio). The code and models are publicly available at https://github.com/UCSC-VLAA/DMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题