角色问题：通过角色吸引关系的视频故事理解

论文标题

角色问题：通过角色吸引关系的视频故事理解

Character Matters: Video Story Understanding with Character-Aware Relations

论文作者

Geng, Shijie, Zhang, Ji, Fu, Zuohui, Gao, Peng, Zhang, Hang, de Melo, Gerard

论文摘要

与简短的视频和GIF不同，视频故事包含清晰的图和主要角色列表。如果没有确定出现的人和角色名称之间的联系，模型就无法获得对情节的真正理解。视频故事问题回答（VSQA）提供了一种有效的方法来基准模型的高级理解能力。但是，当前的VSQA方法只是从场景中提取通用的视觉特征。采用这种方法，它们仍然容易学习表面上的相关性。为了获得对谁对谁做什么的真正理解，我们提出了一个新颖的模型，可以不断地完善角色感知关系。该模型专门考虑视频故事中的字符，以及连接不同字符和对象的关系。基于这些信号，我们的框架可以通过多稳定的同时匹配来实现弱监督的面部命名，并支持使用变压器结构的高级推理。我们在TVQA数据集中的六个不同电视节目上训练和测试我们的模型，该电视数据集是迄今为止最大，唯一可公开可用的VSQA数据集。我们通过大量消融研究验证了对TVQA数据集的建议方法。

Different from short videos and GIFs, video stories contain clear plots and lists of principal characters. Without identifying the connection between appearing people and character names, a model is not able to obtain a genuine understanding of the plots. Video Story Question Answering (VSQA) offers an effective way to benchmark higher-level comprehension abilities of a model. However, current VSQA methods merely extract generic visual features from a scene. With such an approach, they remain prone to learning just superficial correlations. In order to attain a genuine understanding of who did what to whom, we propose a novel model that continuously refines character-aware relations. This model specifically considers the characters in a video story, as well as the relations connecting different characters and objects. Based on these signals, our framework enables weakly-supervised face naming through multi-instance co-occurrence matching and supports high-level reasoning utilizing Transformer structures. We train and test our model on the six diverse TV shows in the TVQA dataset, which is by far the largest and only publicly available dataset for VSQA. We validate our proposed approach over TVQA dataset through extensive ablation study.

下载PDF全文

下载文献需遵守相关版权规定

论文标题