论文标题
基于深度学习的多级音乐源分离的快速准确度估计
Fast accuracy estimation of deep learning based multi-class musical source separation
论文作者
论文摘要
音乐源分离代表从给定歌曲中提取所有乐器的任务。最近在这个挑战的突破已经围绕一个数据集MUSDB引起了人们的注意,仅限四个乐器类别。较大的数据集和更多的仪器在收集数据和培训深神经网络(DNNS)方面耗时且耗时。在这项工作中,我们提出了一种快速方法,可以在没有培训和调整DNN的情况下评估任何数据集中仪器的可分离性。这种可分离性度量有助于选择适当的样本,以进行神经网络的有效培训。基于带有理想比率面罩的甲骨文原理,我们的方法是估计最先进的深度学习方法(例如tasnet或open-unmix)的分离性能的绝佳代理。 Our results contribute to revealing two essential points for audio source separation: 1) the ideal ratio mask, although light and straightforward, provides an accurate measure of the audio separability performance of recent neural nets, and 2) new end-to-end learning methods such as Tasnet, that operate directly on waveforms, are, in fact, internally building a Time-Frequency (TF) representation, so that they encounter the same limitations as the TF based-methods当在TF平面中分离音频模式时。
Music source separation represents the task of extracting all the instruments from a given song. Recent breakthroughs on this challenge have gravitated around a single dataset, MUSDB, only limited to four instrument classes. Larger datasets and more instruments are costly and time-consuming in collecting data and training deep neural networks (DNNs). In this work, we propose a fast method to evaluate the separability of instruments in any dataset without training and tuning a DNN. This separability measure helps to select appropriate samples for the efficient training of neural networks. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches such as TasNet or Open-Unmix. Our results contribute to revealing two essential points for audio source separation: 1) the ideal ratio mask, although light and straightforward, provides an accurate measure of the audio separability performance of recent neural nets, and 2) new end-to-end learning methods such as Tasnet, that operate directly on waveforms, are, in fact, internally building a Time-Frequency (TF) representation, so that they encounter the same limitations as the TF based-methods when separating audio pattern overlapping in the TF plane.