OVC-NET：带有时间图和细节增强的面向对象的视频字幕

论文标题

OVC-NET：带有时间图和细节增强的面向对象的视频字幕

OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement

论文作者

Zhu, Fangyi, Hwang, Jenq-Neng, Ma, Zhanyu, Chen, Guang, Guo, Jun

论文摘要

传统的视频字幕请求对视频进行整体描述，但是对特定对象的详细描述可能无法提供。在不关联移动轨迹的情况下，这些基于图像的数据驱动的方法无法理解对象间视觉特征中时空过渡的活动。此外，在训练中采用模棱两可的剪辑句子对，它违反了由于一对一的性质而学习的多模式功能映射。在本文中，我们提出了一项新颖的任务，以了解对象级的视频，名为面向对象的视频字幕。我们通过时间图介绍了基于视频的面向对象的视频字幕网络（OVC）-NET和详细信息增强功能，以有效地分析活动的活动，并稳定地捕获小样本条件下的视觉语言连接。该时间图为先前基于图像的方法提供了有用的补充，从而可以从视觉特征的时间演变和空间位置的动态运动中推理活动。细节增强有助于捕获不同对象之间的区分特征，随后的字幕模块可以产生更有信息和精确的描述。此后，我们构建了一个新的数据集，提供一致的对象句子对，以促进有效的跨模式学习。为了证明有效性，我们对新数据集进行了实验，并将其与最新的视频字幕方法进行比较。从实验结果中，OVC-NET表现出精确描述并发物体并实现最新性能的能力。

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Without associating the moving trajectories, these image-based data-driven methods cannot understand the activities from the spatio-temporal transitions in the inter-object visual features. Besides, adopting ambiguous clip-sentence pairs in training, it goes against learning the multi-modal functional mappings owing to the one-to-many nature. In this paper, we propose a novel task to understand the videos in object-level, named object-oriented video captioning. We introduce the video-based object-oriented video captioning network (OVC)-Net via temporal graph and detail enhancement to effectively analyze the activities along time and stably capture the vision-language connections under small-sample condition. The temporal graph provides useful supplement over previous image-based approaches, allowing to reason the activities from the temporal evolution of visual features and the dynamic movement of spatial locations. The detail enhancement helps to capture the discriminative features among different objects, with which the subsequent captioning module can yield more informative and precise descriptions. Thereafter, we construct a new dataset, providing consistent object-sentence pairs, to facilitate effective cross-modal learning. To demonstrate the effectiveness, we conduct experiments on the new dataset and compare it with the state-of-the-art video captioning methods. From the experimental results, the OVC-Net exhibits the ability of precisely describing the concurrent objects, and achieves the state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题