多模式电影场景细分的局部到全球方法

论文标题

多模式电影场景细分的局部到全球方法

A Local-to-Global Approach to Multi-modal Movie Scene Segmentation

论文作者

Rao, Anyi, Xu, Linning, Xiong, Yu, Xu, Guodong, Huang, Qingqiu, Zhou, Bolei, Lin, Dahua

论文摘要

作为电影讲故事的关键单位，场景包含了演员的复杂活动及其在物理环境中的互动。确定场景的组成是迈向对电影语义理解的关键一步。与传统视觉问题中研究的视频相比，这非常具有挑战性，例如动作识别，因为电影中的场景通常包含更丰富的时间结构和更复杂的语义信息。为了实现这一目标，我们通过构建一个大型视频数据集电影塞子来扩展场景细分任务，该视频数据集电影包含150部电影的21k注释场景片段。我们进一步提出了一个局部到全球场景分割框架，该框架在三个级别（即剪辑，细分市场和电影）上集成了多模式信息。该框架能够将复杂的语义从层次的时间结构中提炼出来，从而为场景细分提供自上而下的指导。我们的实验表明，提出的网络能够以高准确性将电影分割为场景，并始终优于先前的方法。我们还发现，在电影训练中进行预处理可以为现有方法带来重大改进。

Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging -- compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale up the scene segmentation task by building a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies. We further propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie. This framework is able to distill complex semantics from hierarchical temporal structures over a long movie, providing top-down guidance for scene segmentation. Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods. We also found that pretraining on our MovieScenes can bring significant improvements to the existing approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题