阅读策略启发的视觉表示学习文本到视频检索

论文标题

阅读策略启发的视觉表示学习文本到视频检索

Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

论文作者

Dong, Jianfeng, Wang, Yabing, Chen, Xianke, Qu, Xiaoye, Li, Xirong, He, Yuan, Wang, Xun

论文摘要

本文的目的是为了以自然语言的句子的形式进行查询，以进行文本到视频检索的任务，要求从大量未标记的视频中检索与给定查询有关的视频。该任务的成功取决于跨模式表示的学习，这些学习将视频和句子投影到共同的空间以进行语义相似性计算。在这项工作中，我们专注于视频表示学习，这是文本到视频检索的重要组成部分。受到人类阅读策略的启发，我们提出了一个阅读策略启发的视觉表示学习（RIVRL）来代表视频，该视频由两个分支组成：一个预览分支和一个密集阅读的分支。预览分支旨在简要捕获视频的概述信息，而密集阅读分支则旨在获得更多的深入信息。此外，密集阅读的分支知道预览分支捕获的视频概述。这种整体信息被认为对强化读取分支非常有用，可以提取更细粒度的特征。进行了三个数据集上的大量实验，我们的模型RIVRL在TGIF和VATEX上实现了新的最新技术。此外，在MSR-VTT上，我们使用两个视频功能的模型使用七个视频功能显示了与最先进的性能相当的性能，甚至超过了在大型HowTO100M数据集中预先训练的模型。

This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview captured by the previewing branch. Such holistic information is found to be useful for the intensive-reading branch to extract more fine-grained features. Extensive experiments on three datasets are conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and VATEX. Moreover, on MSR-VTT, our model using two video features shows comparable performance to the state-of-the-art using seven video features and even outperforms models pre-trained on the large-scale HowTo100M dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题