扩张的上下文集成网络与视频中时间情绪定位的跨模式共识

论文标题

扩张的上下文集成网络与视频中时间情绪定位的跨模式共识

Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos

论文作者

Li, Juncheng, Xie, Junlin, Zhu, Linchao, Qian, Long, Tang, Siliang, Zhang, Wenqiao, Shi, Haochen, Zhang, Shengyu, Wei, Longhui, Tian, Qi, Zhuang, Yueting

论文摘要

了解人类情绪是智能机器人提供更好的人类机器人相互作用的关键能力。现有作品仅限于修剪视频级别的情感分类，无法找到与情感相对应的时间窗口。在本文中，我们介绍了一项新任务，称为视频中的时间情感本地化（TEL），该任务旨在检测人类的情感并将其相应的时间边界定位在带有一致字幕的未修剪视频中。与时间动作本地化相比，TEL提出了三个独特的挑战：1）情绪的时间动态极为多样； 2）情绪提示都嵌入了外观和复杂的情节中； 3）细粒度的时间注释是复杂且劳动密集型的。为了应对前两个挑战，我们提出了一个新颖的扩张上下文集成网络，该网络与粗细的两流体系结构。粗流通过建模多晶格时间上下文来捕获各种时间动力学。细流通过推理来自粗流的多晶格时间上下文之间的依赖性来实现复杂的理解，并将它们自适应地整合到细颗粒的视频段特征中。为了应对第三个挑战，我们引入了跨模式的共识学习范式，该范式利用了对齐视频和字幕之间固有的语义共识，以实现弱监督的学习。我们为新的测试集提供了3,000个手动宣布的时间边界，以便可以对TEL问题进行未来的研究进行定量评估。广泛的实验显示了我们方法对时间情绪定位的有效性。这项工作的存储库位于https://github.com/yyjmjc/temporal-emotion-localization-in-videos。

Understanding human emotions is a crucial ability for intelligent robots to provide better human-robot interactions. The existing works are limited to trimmed video-level emotion classification, failing to locate the temporal window corresponding to the emotion. In this paper, we introduce a new task, named Temporal Emotion Localization in videos~(TEL), which aims to detect human emotions and localize their corresponding temporal boundaries in untrimmed videos with aligned subtitles. TEL presents three unique challenges compared to temporal action localization: 1) The emotions have extremely varied temporal dynamics; 2) The emotion cues are embedded in both appearances and complex plots; 3) The fine-grained temporal annotations are complicated and labor-intensive. To address the first two challenges, we propose a novel dilated context integrated network with a coarse-fine two-stream architecture. The coarse stream captures varied temporal dynamics by modeling multi-granularity temporal contexts. The fine stream achieves complex plots understanding by reasoning the dependency between the multi-granularity temporal contexts from the coarse stream and adaptively integrates them into fine-grained video segment features. To address the third challenge, we introduce a cross-modal consensus learning paradigm, which leverages the inherent semantic consensus between the aligned video and subtitle to achieve weakly-supervised learning. We contribute a new testing set with 3,000 manually-annotated temporal boundaries so that future research on the TEL problem can be quantitatively evaluated. Extensive experiments show the effectiveness of our approach on temporal emotion localization. The repository of this work is at https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题