论文标题

VLG网络:视频接地的视频语言图形匹配网络

VLG-Net: Video-Language Graph Matching Network for Video Grounding

论文作者

Soldan, Mattia, Xu, Mengmeng, Qu, Sisi, Tegner, Jesper, Ghanem, Bernard

论文摘要

视频中的基础语言查询旨在确定与语言查询有关的时间间隔(或时刻)。解决这项具有挑战性的任务的解决方案要求了解视频的语义内容以及有关其多模式相互作用的细粒度推理。我们的关键想法是将这一挑战重新提高到算法图形匹配问题中。在图形神经网络的最新进展中,我们建议利用图形卷积网络来对视频和文本信息进行建模以及它们的语义一致性。为了启用各种方式之间的信息交换,我们设计了一个新颖的视频图形匹配网络(VLG-net)以匹配视频和查询图。核心成分包括在视频片段上构建的表示图和分别查询令牌,并用于建模模式内关系。用于跨模式上下文建模和多模式融合的图形匹配层。最后,通过融合时刻的丰富片段功能,使用蒙面的瞬间注意力集合创建了瞬间候选者。我们在三个广泛使用的数据集上展示了比最先进的接地方法的卓越性能,用于在带有语言查询的视频中的时间定位:活动网络捕获,炸玉米饼和didemo。

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源