论文标题
视觉声音分离的音乐手势
Music Gesture for Visual Sound Separation
论文作者
论文摘要
最近的深度学习方法在视觉声音分离任务上取得了令人印象深刻的表现。但是,这些方法主要建立在外观和光流(如运动特征表示)上,这些功能表示有限的能力可以找到音频信号和视觉点之间的相关性,尤其是在将相同类型的多种仪器(例如场景中的多个小提琴)分开时。为了解决这个问题,我们提出了“音乐手势”,这是一种基于按键的结构化表示形式,以明确地对音乐家进行音乐时的身体和手指运动进行建模。我们首先采用上下文感知的图形网络将视觉语义上下文与身体动力学集成在一起,然后应用视听融合模型将身体运动与相应的音频信号联系起来。三个音乐性能数据集的实验结果显示:1)在基准指标上进行异性音乐分离任务(即不同的工具)的强大改进; 2)钢琴,长笛和小号二重奏有效的同性音乐分离的新能力,据我们所知,替代方法从未实现这一目标。项目页面:http://music-gesture.csail.mit.edu。
Recent deep learning approaches have achieved impressive performance on visual sound separation tasks. However, these approaches are mostly built on appearance and optical flow like motion feature representations, which exhibit limited abilities to find the correlations between audio signals and visual points, especially when separating multiple instruments of the same types, such as multiple violins in a scene. To address this, we propose "Music Gesture," a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals. Experimental results on three music performance datasets show: 1) strong improvements upon benchmark metrics for hetero-musical separation tasks (i.e. different instruments); 2) new ability for effective homo-musical separation for piano, flute, and trumpet duets, which to our best knowledge has never been achieved with alternative methods. Project page: http://music-gesture.csail.mit.edu.