论文标题
视频场景细分的边界感知的自我监督学习
Boundary-aware Self-supervised Learning for Video Scene Segmentation
论文作者
论文摘要
自我监督的学习通过其在学习内域表示的有效性而引起了人们的注意,而没有基本真相注释。特别是,表明正确设计的借口任务(例如,对比预测任务)为下游任务带来了显着的绩效增长(例如,分类任务)。从此开始,我们解决了视频场景细分,这是视频中时间上定位场景边界的一项任务,并具有一个自我监督的学习框架,我们主要专注于设计有效的借口任务。在我们的框架中,我们通过将其分成两个连续的,非重叠的子序列并利用伪基准来促进预训练,从一系列镜头中发现了一个伪结合。基于此,我们介绍了三个新颖的边界意识借口任务:1)镜头匹配(SSM),2)上下文组匹配(CGM)和3)伪边界预测(PP); SSM和CGM指导该模型以最大程度地提高景点内相似性和景观间歧视,而PP则鼓励模型识别过渡力矩。通过全面的分析,我们从经验上表明,预训练和转移上下文表示对于改善视频场景细分性能至关重要。最后,我们实现了Movienet-Sseg基准测试的新最新。该代码可在https://github.com/kakaobrain/bassl上找到。
Self-supervised learning has drawn attention through its effectiveness in learning in-domain representations with no ground-truth annotations; in particular, it is shown that properly designed pretext tasks (e.g., contrastive prediction task) bring significant performance gains for downstream tasks (e.g., classification task). Inspired from this, we tackle video scene segmentation, which is a task of temporally localizing scene boundaries in a video, with a self-supervised learning framework where we mainly focus on designing effective pretext tasks. In our framework, we discover a pseudo-boundary from a sequence of shots by splitting it into two continuous, non-overlapping sub-sequences and leverage the pseudo-boundary to facilitate the pre-training. Based on this, we introduce three novel boundary-aware pretext tasks: 1) Shot-Scene Matching (SSM), 2) Contextual Group Matching (CGM) and 3) Pseudo-boundary Prediction (PP); SSM and CGM guide the model to maximize intra-scene similarity and inter-scene discrimination while PP encourages the model to identify transitional moments. Through comprehensive analysis, we empirically show that pre-training and transferring contextual representation are both critical to improving the video scene segmentation performance. Lastly, we achieve the new state-of-the-art on the MovieNet-SSeg benchmark. The code is available at https://github.com/kakaobrain/bassl.