视觉和语言导航的自我监督的3D语义表示学习

论文标题

视觉和语言导航的自我监督的3D语义表示学习

Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

论文作者

Tan, Sinan, Ge, Mengmeng, Guo, Di, Liu, Huaping, Sun, Fuchun

论文摘要

在视觉和语言导航任务中，体现的代理遵循语言说明并导航到特定目标。在许多实际情况下，这一点很重要，并且引起了计算机视觉和机器人社区的广泛关注。但是，大多数现有作品仅使用RGB图像，但忽略了场景的3D语义信息。为此，我们开发了一个新颖的自我监督训练框架，以将体素级3D语义重建编码为3D语义表示。具体而言，区域查询任务被设计为借口任务，该任务可以预测特定3D区域中特定类的对象的存在或不存在。然后，我们构建了一个基于LSTM的导航模型，并使用Vision语言对上提出的3D语义表示和BERT语言功能进行训练。实验表明，所提出的方法分别在看不见和测试R2R数据集的验证和测试未见分裂方面达到了68％和66％的成功率，利用视觉变压器利用视觉变压器的大多数基于RGB的方法优于基于RGB的方法。

In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset respectively, which are superior to most of RGB-based methods utilizing vision-language transformers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题