论文标题
通过时空参考检索和突出作用
Retrieving and Highlighting Action with Spatiotemporal Reference
论文作者
论文摘要
在本文中,我们提出了一个框架,该框架通过增强当前的深层模式检索方法,共同检索和时空突出视频中的动作。我们的作品采用了新颖的动作突出显示,该任务可视化在未修剪视频设置中发生动作的位置和何时。与侧重于分类或基于窗口的本地化的常规动作识别任务相比,动作突出显示是一项精细的任务。利用带注释的标题的弱监督,我们的框架获取时空相关性图,并生成与标题中名词和动词相关的局部嵌入。通过实验,我们表明我们的模型生成了以不同动作为条件的各种地图,在不同的动作中,传统的视觉推理方法只能显示出单个确定性显着性图。此外,我们的模型将基线的检索召回改善,而无需对齐MSR-VTT数据集,而无需对齐。
In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions. Through experiments, we show that our model generates various maps conditioned on different actions, in which conventional visual reasoning methods only go as far as to show a single deterministic saliency map. Also, our model improves retrieval recall over our baseline without alignment by 2-3% on the MSR-VTT dataset.