采用风景秀丽的路线：改善视觉和语言导航的概括

论文标题

采用风景秀丽的路线：改善视觉和语言导航的概括

Take the Scenic Route: Improving Generalization in Vision-and-Language Navigation

论文作者

Yu, Felix, Deng, Zhiwei, Narasimhan, Karthik, Russakovsky, Olga

论文摘要

在视觉和语言导航（VLN）任务中，以自然而然的语言指示导航到目的地。手动注释这些说明的行为及时且昂贵，因此许多现有方法会自动生成其他样本以提高代理性能。但是，这些方法仍然很难将其性能推广到新环境。在这项工作中，我们研究了流行的房间（R2R）VLN基准标准，发现重要的不仅是您合成的数据量，还包括您的工作方式。我们发现，R2R基准和现有的增强方法都使用的最短路径采样编码了我们将代理的动作空间中的偏差编码我们将其作为动作先验的偏差。然后，我们证明这些行动先验对现有作品的不良概括提供了一种解释。为了减轻此类先验，我们建议基于随机步行的路径采样方法来增强数据。通过使用这种增强策略进行培训，与基线相比，我们的代理商能够更好地将其推广到未知环境，从而显着改善了过程中的模型性能。

In the Vision-and-Language Navigation (VLN) task, an agent with egocentric vision navigates to a destination given natural language instructions. The act of manually annotating these instructions is timely and expensive, such that many existing approaches automatically generate additional samples to improve agent performance. However, these approaches still have difficulty generalizing their performance to new environments. In this work, we investigate the popular Room-to-Room (R2R) VLN benchmark and discover that what is important is not only the amount of data you synthesize, but also how you do it. We find that shortest path sampling, which is used by both the R2R benchmark and existing augmentation methods, encode biases in the action space of the agent which we dub as action priors. We then show that these action priors offer one explanation toward the poor generalization of existing works. To mitigate such priors, we propose a path sampling method based on random walks to augment the data. By training with this augmentation strategy, our agent is able to generalize better to unknown environments compared to the baseline, significantly improving model performance in the process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题