通过异步视听整合查找掉落的对象

论文标题

通过异步视听整合查找掉落的对象

Finding Fallen Objects Via Asynchronous Audio-Visual Integration

论文作者

Gan, Chuang, Gu, Yi, Zhou, Siyuan, Schwartz, Jeremy, Alter, Seth, Traer, James, Gutfreund, Dan, Tenenbaum, Joshua B., McDermott, Josh, Torralba, Antonio

论文摘要

对象的外观和声音的方式提供了对其物理特性的互补反射。在许多设置中，视觉和试镜的提示异步到达，但必须集成，就像我们听到一个物体掉落在地板上，然后必须找到它时。在本文中，我们介绍了一个在3D虚拟环境中研究多模式对象定位的设置。一个物体在房间的某个地方掉落。配备了相机和麦克风的具体机器人代理必须通过将音频和视觉信号与知识的基础物理学相结合，并在哪里删除了哪些对象。为了研究此问题，我们生成了一个大规模数据集 - 倒下的对象数据集，其中包括64个房间中30个物理对象类别的8000个实例。该数据集使用Threedworld平台，该平台可以模拟基于物理的影响声音和在影片设置中对象之间的复杂物理交互。作为解决这一挑战的第一步，我们基于模仿学习，增强学习和模块化计划，开发了一组体现的代理基线，并对这项新任务的挑战进行了深入的分析。

The way an object looks and sounds provide complementary reflections of its physical properties. In many settings cues from vision and audition arrive asynchronously but must be integrated, as when we hear an object dropped on the floor and then must find it. In this paper, we introduce a setting in which to study multi-modal object localization in 3D virtual environments. An object is dropped somewhere in a room. An embodied robot agent, equipped with a camera and microphone, must determine what object has been dropped -- and where -- by combining audio and visual signals with knowledge of the underlying physics. To study this problem, we have generated a large-scale dataset -- the Fallen Objects dataset -- that includes 8000 instances of 30 physical object categories in 64 rooms. The dataset uses the ThreeDWorld platform which can simulate physics-based impact sounds and complex physical interactions between objects in a photorealistic setting. As a first step toward addressing this challenge, we develop a set of embodied agent baselines, based on imitation learning, reinforcement learning, and modular planning, and perform an in-depth analysis of the challenge of this new task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题