论文标题
可分开的卷积与复发性神经网络,用于单声道唱歌语音分离
Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation
论文作者
论文摘要
音乐源分离的最新方法几乎完全基于深层神经网络,主要采用了复发性神经网络(RNN)。尽管在许多情况下,RNN比其他类型的深神经网络都优于序列处理,但已知它们在训练和并行化方面遇到了特定的困难,尤其是对于音乐源分离中遇到的典型长序列。在本文中,我们提出了用深度可分离(DWS)卷积代替RNN的用例,该卷积是典型卷积的轻量级和更快的变体。我们专注于使用RNN体系结构唱歌语音分离,并用DWS卷积(DWS-CNN)代替RNN。我们通过利用信噪比的标准指标,信噪比和距离距离的比例来进行消融研究,并检查DWS-CNN的通道和层数对源分离性能的影响。我们的结果表明,通过用DWS-CNN替换RNN的结果分别提高1.20、0.06、0.37 dB,而仅使用RNN架构参数量的20.57%。
Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.