Movieclip：电影中的视觉场景识别

论文标题

Movieclip：电影中的视觉场景识别

MovieCLIP: Visual Scene Recognition in Movies

论文作者

Bose, Digbalay, Hebbar, Rajat, Somandepalli, Krishna, Zhang, Haoyang, Cui, Yin, Cole-McLaughlin, Kree, Wang, Huisheng, Narayanan, Shrikanth

论文摘要

诸如电影之类的长形媒体具有复杂的叙事结构，事件涵盖了各种环境视觉场景。与电影中的视觉场景相关的领域特定挑战包括过渡，人覆盖范围以及各种各样的现实生活和虚构场景。电影中的现有视觉场景数据集的分类学有限，并且不考虑电影剪辑中的视觉场景过渡。在这项工作中，我们通过首先自动策划了一种从电影脚本和基于辅助Web的视频数据集衍生出的179个场景标签的新的，以电影为中心的新型分类学标签来解决电影中视觉场景识别的问题。我们使用剪辑来根据我们建议的分类法，而不是使用剪辑来弱的手动注释，而是使用剪辑薄弱地标记了32K电影剪辑的112万张照片。我们提供在称为MovieClip的弱标记数据集上训练的基线视觉模型，并在由人类评估者验证的独立数据集上进行评估。我们表明，在MovieClip上预测的模型的利用功能受益于下游任务，例如多标签场景以及网络视频和电影预告片的流派分类。

Longform media such as movies have complex narrative structures, with events spanning a rich variety of ambient visual scenes. Domain specific challenges associated with visual scenes in movies include transitions, person coverage, and a wide array of real-life and fictional scenarios. Existing visual scene datasets in movies have limited taxonomies and don't consider the visual scene transition within movie clips. In this work, we address the problem of visual scene recognition in movies by first automatically curating a new and extensive movie-centric taxonomy of 179 scene labels derived from movie scripts and auxiliary web-based video datasets. Instead of manual annotations which can be expensive, we use CLIP to weakly label 1.12 million shots from 32K movie clips based on our proposed taxonomy. We provide baseline visual models trained on the weakly labeled dataset called MovieCLIP and evaluate them on an independent dataset verified by human raters. We show that leveraging features from models pretrained on MovieCLIP benefits downstream tasks such as multi-label scene and genre classification of web videos and movie trailers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题