端到端的歌词识别自我监督的学习

论文标题

端到端的歌词识别自我监督的学习

End-to-End Lyrics Recognition with Self-supervised Learning

论文作者

Zhang, Xiangyu, Li, Shuyue Stella, He, Zhanhong, Togneri, Roberto, Garcia, Leibny Paola

论文摘要

歌词识别是音乐处理中的重要任务。尽管传统的算法（例如混合HMM-TDNN模型都达到了良好的性能，但应用端到端模型和自我监督学习（SSL）的研究还是有限的。在本文中，我们首先建立了歌词识别的端到端基线，然后探索SSL模型在歌词识别任务中的性能。我们通过不同的训练方法（掩盖重建，掩盖预测，自回归重建和对比度学习）评估了各种上游SSL模型。我们在潮湿的音乐数据集上进行评估的端到端自我监督模型，即使没有大型语料库培训的语言模型，DEV集合的端到的模型也优于先前的最新系统（SOTA）系统，而测试集则胜过2.4％。此外，我们研究了背景音乐对自我监督学习模型的性能的影响，并得出结论，SSL模型无法在背景音乐的存在下有效提取功能。最后，考虑到这些模型未在音乐数据集中培训，我们研究了SSL功能的跨域泛化能力。

Lyrics recognition is an important task in music processing. Despite traditional algorithms such as the hybrid HMM- TDNN model achieving good performance, studies on applying end-to-end models and self-supervised learning (SSL) are limited. In this paper, we first establish an end-to-end baseline for lyrics recognition and then explore the performance of SSL models on lyrics recognition task. We evaluate a variety of upstream SSL models with different training methods (masked reconstruction, masked prediction, autoregressive reconstruction, and contrastive learning). Our end-to-end self-supervised models, evaluated on the DAMP music dataset, outperform the previous state-of-the-art (SOTA) system by 5.23% for the dev set and 2.4% for the test set even without a language model trained by a large corpus. Moreover, we investigate the effect of background music on the performance of self-supervised learning models and conclude that the SSL models cannot extract features efficiently in the presence of background music. Finally, we study the out-of-domain generalization ability of the SSL features considering that those models were not trained on music datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题