使用卷积神经网络进行声乐合奏中的多个F0估计

论文标题

使用卷积神经网络进行声乐合奏中的多个F0估计

Multiple F0 Estimation in Vocal Ensembles using Convolutional Neural Networks

论文作者

Cuesta, Helena, McFee, Brian, Gómez, Emilia

论文摘要

本文介绍了使用卷积神经网络（CNN）从多形和无伴奏合唱表演中提取多个F0值的。我们应对合奏唱歌的主要挑战，即，所有旋律的来源都是人声和和谐的歌手。我们建立在现有的体系结构上，以产生输入信号的音高显着性函数，在该信号中，谐波常数Q变换（HCQT）及其相关的相位差异用作输入表示。随后将螺距显着性函数阈值获得多个F0估计输出。为了培训，我们构建了一个数据集，该数据集包括几个带有F0注释的人声四重奏的数据集。这项工作在不同的情况和数据配置中提出并评估了此任务的一组CNN，包括带有其他混响的录音。我们的模型的表现优于一种最先进的方法，用于使用F0分辨率增加，以及用于多F0估计的通用方法时，用于相同的音乐类型。我们以讨论未来的研究方向进行了结论。

This paper addresses the extraction of multiple F0 values from polyphonic and a cappella vocal performances using convolutional neural networks (CNNs). We address the major challenges of ensemble singing, i.e., all melodic sources are vocals and singers sing in harmony. We build upon an existing architecture to produce a pitch salience function of the input signal, where the harmonic constant-Q transform (HCQT) and its associated phase differentials are used as an input representation. The pitch salience function is subsequently thresholded to obtain a multiple F0 estimation output. For training, we build a dataset that comprises several multi-track datasets of vocal quartets with F0 annotations. This work proposes and evaluates a set of CNNs for this task in diverse scenarios and data configurations, including recordings with additional reverb. Our models outperform a state-of-the-art method intended for the same music genre when evaluated with an increased F0 resolution, as well as a general-purpose method for multi-F0 estimation. We conclude with a discussion on future research directions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题