论文标题
音乐时代:通过低级功能建模基于高级功能的可控音乐生成
Music FaderNets: Controllable Music Generation Based On High-Level Features via Low-Level Feature Modelling
论文作者
论文摘要
高级音乐品质(例如情感)通常是抽象的,主观的,并且难以量化。鉴于这些困难,通过监督学习技巧学习良好的功能表示并不容易,要么是因为标签的不足,要么是人类宣布的标签中的主观性(以及很大的差异)。在本文中,我们提出了一个框架,该框架可以通过首先对其相应的可量化低级属性进行建模,以有限的数据学习高级特征表示。我们将我们提出的框架称为音乐循环,这是受到以下事实的启发:低级属性可以通过特征脱离和潜在的正则化技术通过单独的“滑动推子”来连续操纵。然后,使用高斯混合物(GM-VAES),通过半监督聚类(GM-VAES)从低级表示通过半监督聚类来推断高级特征。以唤醒为高级特征的示例,我们表明我们的模型的“循环”是分离的,并线性地改变了W.R.T.生成的输出音乐的建模低级属性。此外,我们证明该模型成功地学习了唤醒与其相应的低级属性(节奏和注意密度)之间的内在关系,只有1%的训练集被标记。最后,使用博学的高级功能表示形式,我们探讨了框架在不同唤醒状态的样式传输任务中的应用。通过主观听力测试来验证这种方法的有效性。
High-level musical qualities (such as emotion) are often abstract, subjective, and hard to quantify. Given these difficulties, it is not easy to learn good feature representations with supervised learning techniques, either because of the insufficiency of labels, or the subjectiveness (and hence large variance) in human-annotated labels. In this paper, we present a framework that can learn high-level feature representations with a limited amount of data, by first modelling their corresponding quantifiable low-level attributes. We refer to our proposed framework as Music FaderNets, which is inspired by the fact that low-level attributes can be continuously manipulated by separate "sliding faders" through feature disentanglement and latent regularization techniques. High-level features are then inferred from the low-level representations through semi-supervised clustering using Gaussian Mixture Variational Autoencoders (GM-VAEs). Using arousal as an example of a high-level feature, we show that the "faders" of our model are disentangled and change linearly w.r.t. the modelled low-level attributes of the generated output music. Furthermore, we demonstrate that the model successfully learns the intrinsic relationship between arousal and its corresponding low-level attributes (rhythm and note density), with only 1% of the training set being labelled. Finally, using the learnt high-level feature representations, we explore the application of our framework in style transfer tasks across different arousal states. The effectiveness of this approach is verified through a subjective listening test.