LGDN：用于视频语言建模的语言引导网络

论文标题

LGDN：用于视频语言建模的语言引导网络

LGDN: Language-Guided Denoising Network for Video-Language Modeling

论文作者

Lu, Haoyu, Ding, Mingyu, Fei, Nanyi, Huo, Yuqi, Lu, Zhiwu

论文摘要

通过网络视频的快速增长，视频语言建模引起了很多关注。大多数现有方法都假定视频帧和文本描述是语义上关联的，并且专注于视频级别的视频建模。但是，该假设通常是有两个原因的：（1）使用视频内容的丰富语义，很难用单个视频级别的描述覆盖所有帧；（2）原始视频通常具有嘈杂/毫无意义的信息（例如，拍摄的风景，过渡或预告片）。尽管最近的许多作品部署了注意机制来减轻此问题，但无关/嘈杂的信息仍然使得很难解决。为了克服此类挑战，我们提出了一个高效有效的模型，称为语言引导网络（LGDN），用于视频语言建模。与使用所有提取的视频帧的大多数现有方法不同，LGDN在语言监督下动态过滤了未对准或冗余的帧，并且每个视频仅获得2---4个显着帧，以进行交叉模式令牌级别的对准。在五个公共数据集上进行的广泛实验表明，我们的LGDN的表现优于最先进的利润率。我们还提供详细的消融研究，以揭示解决噪声问题的重要重要性，希望激发未来的视频语言工作。

Video-language modeling has attracted much attention with the rapid growth of web videos. Most existing methods assume that the video frames and text description are semantically correlated, and focus on video-language modeling at video level. However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e.g., scenery shot, transition or teaser). Although a number of recent works deploy attention mechanism to alleviate this problem, the irrelevant/noisy information still makes it very difficult to address. To overcome such challenge, we thus propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN), for video-language modeling. Different from most existing methods that utilize all extracted video frames, LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2--4 salient frames per video for cross-modal token-level alignment. Extensive experiments on five public datasets show that our LGDN outperforms the state-of-the-arts by large margins. We also provide detailed ablation study to reveal the critical importance of solving the noise issue, in hope of inspiring future video-language work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题