论文标题
Trillsson:蒸馏通用的副语言语音表示
TRILLsson: Distilled Universal Paralinguistic Speech Representations
论文作者
论文摘要
自学的最新进展极大地提高了语音表征的质量。但是,由于公共可用性有限和资源大大,在设备上的最先进嵌入模型的部署受到限制。我们的工作通过公开发布一系列小规模且几乎最先进的性能来解决这些问题。我们的方法基于知识蒸馏,我们的模型仅在公共数据上蒸馏。我们探索不同的体系结构,并在非语义语音(NOSS)基准上彻底评估我们的模型。我们最大的蒸馏型模型小于原始型号(314MB vs 2.2GB)的15%,达到了7个任务中6个精度的96%以上,并以6.5%的数据训练。最小的模型的大小为1%(22MB),并且在7个任务中的6个中的准确性超过90%。我们的模型在7个任务中的6个任务上的开源WAV2VEC 2.0模型优于开源WAV2VEC 2.0模型,尽管大小为7%,但在两个情绪识别任务上,在两个情绪识别任务上的开源WAV2VEC 2.0都优于开源WAV2VEC 2.0。
Recent advances in self-supervision have dramatically improved the quality of speech representations. However, deployment of state-of-the-art embedding models on devices has been restricted due to their limited public availability and large resource footprint. Our work addresses these issues by publicly releasing a collection of paralinguistic speech models that are small and near state-of-the-art performance. Our approach is based on knowledge distillation, and our models are distilled on public data only. We explore different architectures and thoroughly evaluate our models on the Non-Semantic Speech (NOSS) benchmark. Our largest distilled model is less than 15% the size of the original model (314MB vs 2.2GB), achieves over 96% the accuracy on 6 of 7 tasks, and is trained on 6.5% the data. The smallest model is 1% in size (22MB) and achieves over 90% the accuracy on 6 of 7 tasks. Our models outperform the open source Wav2Vec 2.0 model on 6 of 7 tasks, and our smallest model outperforms the open source Wav2Vec 2.0 on both emotion recognition tasks despite being 7% the size.