较弱监督主动扬声器本地化的交叉模态视频表示

论文标题

较弱监督主动扬声器本地化的交叉模态视频表示

Cross modal video representations for weakly supervised active speaker localization

论文作者

Sharma, Rahul, Somandepalli, Krishna, Narayanan, Shrikanth

论文摘要

对媒体描述的客观理解，例如在屏幕上（例如在电影和电视中）在屏幕上听到和看到多少人的包容性描写，要求这些机器可以自动辨别谁，何时，如何和在哪里说话，而不是在何处，而不是。可以自动从媒体内容中存在的丰富多模式信息中辨别出扬声器活动。但是，由于媒体内容的巨大变化和缺乏标签数据，这是一个具有挑战性的问题。在这项工作中，我们提出了一个用于学习视觉表示的跨模式神经网络，该网络具有与视觉框架中说话者的空间位置有关的隐式信息。避免需要在视觉框架中对主动扬声器进行手动注释，而获取非常昂贵，我们为在电影内容中定位主动扬声器的任务提供了一个弱监督的系统。我们使用学识渊博的跨模式视觉表示，并提供了电影字幕的薄弱监督，该字幕充当语音活动的代理，因此不需要手动注释。我们评估了在AVA主动扬声器数据集上提出的系统的性能，并证明了与完全监督的系统相比，跨模式嵌入对本地化主动扬声器的有效性。我们还在视听框架中展示了语音活动检测任务的最新性能，尤其是在语音伴随噪音和音乐时。

An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in the media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. We evaluate the performance of the proposed system on the AVA active speaker dataset and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems. We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework, especially when speech is accompanied by noise and music.

下载PDF全文

下载文献需遵守相关版权规定

论文标题