论文标题
Tutornet:朝着端到端语音识别的灵活知识蒸馏
TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition
论文作者
论文摘要
近年来,在开发端到端语音识别模型方面进行了大量研究,这可以简化传统的管道并取得了令人鼓舞的结果。尽管其性能取得了显着改善,但端到端模型通常需要昂贵的计算成本才能表现出成功的性能。为了减少这种计算负担,一种流行的模型压缩方法的知识蒸馏(KD)已被用来将知识从深层而复杂的模型(教师)转移到更浅,更简单的模型(学生)。以前的KD方法通常通过减少每层宽度或教师模型层数来设计学生模型的体系结构。这种结构减少方案可能会限制模型选择的灵活性,因为学生模型结构应与给定教师的结构相似。为了应对这一限制,我们提出了一种新的KD方法,用于端到端语音识别,即Tutornet,可以在隐藏表示级别以及输出级别的不同类型的神经网络上转移知识。对于具体实现,我们首先在初始化步骤中应用表示级知识蒸馏(RKD),然后将SoftMax级知识蒸馏(SKD)与原始任务学习相结合。当学生接受RKD培训时,我们会使用框架加权,以指出教师模型更多注意的框架。通过在Librispeech数据集上进行的许多实验,可以验证的是,所提出的方法不仅提炼具有不同拓扑的网络之间的知识,而且还有助于提高蒸馏学生的单词错误率(WER)表现。有趣的是,在某些特定情况下,Tutornet允许学生模型超过其教师的表现。
In recent years, there has been a great deal of research in developing end-to-end speech recognition models, which enable simplifying the traditional pipeline and achieving promising results. Despite their remarkable performance improvements, end-to-end models typically require expensive computational cost to show successful performance. To reduce this computational burden, knowledge distillation (KD), which is a popular model compression method, has been used to transfer knowledge from a deep and complex model (teacher) to a shallower and simpler model (student). Previous KD approaches have commonly designed the architecture of the student model by reducing the width per layer or the number of layers of the teacher model. This structural reduction scheme might limit the flexibility of model selection since the student model structure should be similar to that of the given teacher. To cope with this limitation, we propose a new KD method for end-to-end speech recognition, namely TutorNet, that can transfer knowledge across different types of neural networks at the hidden representation-level as well as the output-level. For concrete realizations, we firstly apply representation-level knowledge distillation (RKD) during the initialization step, and then apply the softmax-level knowledge distillation (SKD) combined with the original task learning. When the student is trained with RKD, we make use of frame weighting that points out the frames to which the teacher model pays more attention. Through a number of experiments on LibriSpeech dataset, it is verified that the proposed method not only distills the knowledge between networks with different topologies but also significantly contributes to improving the word error rate (WER) performance of the distilled student. Interestingly, TutorNet allows the student model to surpass its teacher's performance in some particular cases.