Omnivl：图像语言和视频语言任务的一个基础模型

论文标题

Omnivl：图像语言和视频语言任务的一个基础模型

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

论文作者

Wang, Junke, Chen, Dongdong, Wu, Zuxuan, Luo, Chong, Zhou, Luowei, Zhao, Yucheng, Xie, Yujia, Liu, Ce, Jiang, Yu-Gang, Yuan, Lu

论文摘要

本文介绍了Omnivl，这是一种新的基础模型，旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器，因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式有利于图像和视频任务，而不是传统的单向传输（例如，使用图像语言来帮助视频语言）。为此，我们提出了图像语言和视频语言的脱钩关节预处理，以有效地将视觉模型分解为空间和时间维度，并在图像和视频任务上获得性能提升。此外，我们引入了一种新型的统一视觉对比度（UNIVLC）损失，以利用图像文本，视频文本，图像标签（例如，图像分类），视频标签（例如，视频动作识别）数据一起使用，以便尽可能地使用受监管和吵闹的预识数据。无需额外的任务适配器，Omnivl可以同时支持仅视觉任务（例如，图像分类，视频操作识别），跨模式对准任务（例如，图像/视频 - 文本检索）和多模式的理解和生成任务（例如，图像/视频/视频询问，图像/视频/视频询问，响应图像/视频答案）。我们在各种下游任务上评估Omnivl，并以相似的模型大小和数据量表获得最新的或竞争结果。

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题