论文标题
从可听见的互动中学习州感知的视觉表示
Learning State-Aware Visual Representations from Audible Interactions
论文作者
论文摘要
我们提出了一种自我监督算法,以从以自我为中心的视频数据中学习表示形式。最近,已经做出了巨大的努力,以捕捉人类在日常活动中与自己的环境进行互动。结果,已经出现了几个大型的以相互作用多模式数据的自我为中心数据。但是,来自视频的学习表征可能具有挑战性。首先,鉴于长期连续视频的未经保育的性质,学习有效表示需要专注于互动时的时刻。其次,日常活动的视觉表示应对环境状态的变化敏感。但是,当前成功的多模式学习框架会随着时间的推移表示代表。为了应对这些挑战,我们利用音频信号来确定有利于更好学习的可能相互作用的时刻。我们还提出了一个新颖的自我监督目标,该目标从相互作用引起的可听见状态变化中学习。我们在两个大规模的以egitrectric数据集为Epic-kitchens-100和最近发布的EGO4D上广泛验证了这些贡献,并显示了几个下游任务的改进,包括行动识别,长期行动预期和对象状态变化分类。
We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.