针对弱监督视频表示的双重校准网络学习

论文标题

针对弱监督视频表示的双重校准网络学习

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

论文作者

Long, Fuchen, Yao, Ting, Qiu, Zhaofan, Tian, Xinmei, Luo, Jiebo, Mei, Tao

论文摘要

大量网络视频与搜索查询或周围文本（例如标题）配对的杠杆作用提供了经济且可扩展的替代方案，可替代监督视频表示学习。然而，由于查询多义（即查询的许多可能的含义）和文本同构（即不同文本的相同句法结构），对这种弱视文的连接进行建模并不是微不足道的。在本文中，我们介绍了查询和文本之间相互校准的新设计，以促进弱监督的视频表示学习。具体而言，我们提出了双重校准网络（BCN），这些网络在新颖地融合了两个校准，以学习从文本到查询的修正案，反之亦然。从技术上讲，BCN在通过相同查询搜索的视频的所有标题上执行聚类，并将每个集群的质心作为文本原型。查询词汇直接建立在查询单词上。对文本原型/查询词汇的视频对文本/视频对话预测，然后启动文本或查询到文本校准，以估算修正案以查询或文本。我们还设计了一个选择方案来平衡两个校正。两个大规模的网络视频数据集与查询和每个视频的标题配对，分别是针对弱监督视频表示的学习，分别命名为Yovo-3M和Yovo-10m。 BCN在3M Web视频上学习的视频功能在下游任务的线性模型协议下获得了卓越的结果。更引人注目的是，BCN接受了较大的10m网络视频培训，其进一步的微调导致了1.6％，并且在Kinetics-400上获得了1.8％的TOP-1准确性，以及对最先进的TDN和ACTION NET方法的一些效果V2数据集，并具有Imagenet预读取。源代码和数据集可在\ url {https://github.com/fuchenustc/bcn}上获得。

The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. The query vocabulary is built directly on query words. The video-to-text/video-to-query projections over text prototypes/query vocabulary then start the text-to-query or query-to-text calibration to estimate the amendment to query or text. We also devise a selection scheme to balance the two corrections. Two large-scale web video datasets paired with query and title for each video are newly collected for weakly-supervised video representation learning, which are named as YOVO-3M and YOVO-10M, respectively. The video features of BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks. More remarkably, BCN trained on the larger set of 10M web videos with further fine-tuning leads to 1.6%, and 1.8% gains in top-1 accuracy on Kinetics-400, and Something-Something V2 datasets over the state-of-the-art TDN, and ACTION-Net methods with ImageNet pre-training. Source code and datasets are available at \url{https://github.com/FuchenUSTC/BCN}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题