通过视频和文本对学习时空特征

论文标题

通过视频和文本对学习时空特征

Learning Spatiotemporal Features via Video and Text Pair Discrimination

论文作者

Li, Tianhao, Wang, Limin

论文摘要

当前的视频表示在很大程度上依赖于从手动注释的视频数据集中学习，这些数据集既耗时又昂贵。我们观察到视频自然伴随着丰富的文本信息，例如YouTube标题和Instagram字幕。在本文中，我们利用这种视觉文本连接以有效的弱监督方式学习时空特征。我们提出了一个通用的跨模式歧视（CPD）框架，以捕获视频及其相关文本之间的相关性。具体而言，我们采用噪声对焦估计来解决由大量的成对实例类别和设计实用的课程学习策略所引起的计算问题。我们在标准视频数据集（Kinetics-210K）和未经保修的Web视频数据集（Instagram-300K）上训练我们的CPD模型以证明其有效性。没有进一步的微调，学习的模型在线性分类方案下获得了动作动力学的动作分类的竞争结果。此外，我们的视觉模型提供了有效的初始化，可以对下游任务进行微调，这与现有的最先进的自我监督训练方法相比，在UCF101和HMDB51上产生了显着的行动识别性能。此外，我们的CPD模型通过直接利用学到的视觉文本嵌入方式，在UCF101上产生了一种新的最新技术，以实现在UCF101上的零射击动作识别。该代码将在https://github.com/mcg-nju/cpd-video上提供。

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at https://github.com/MCG-NJU/CPD-Video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题