探索端到端自动驾驶的上下文表示和多模式

论文标题

探索端到端自动驾驶的上下文表示和多模式

Exploring Contextual Representation and Multi-Modality for End-to-End Autonomous Driving

论文作者

Azam, Shoaib, Munir, Farzeen, Kyrki, Ville, Jeon, Moongu, Pedrycz, Witold

论文摘要

学习背景和空间环境表示会在复杂的情况下增强自动驾驶汽车的危害预期和决策。最近的感知系统通过传感器融合增强了空间理解，但通常缺乏完整的环境环境。驾驶时，人类自然使用神经图，这些神经图整合了各种因素，例如历史数据，情境微妙和其他道路使用者的行为预测，以对周围环境形成丰富的背景理解。基于神经图的理解是在道路上做出明智的决定不可或缺的一部分。相比之下，即使随着他们的重大进步，自主系统尚未完全利用人类般的上下文理解的深度。在此激励的基础上，我们的工作从人类的驾驶方式中汲取了灵感，并试图在端到端的自主驾驶框架内形式化传感器融合方法。我们介绍了一个框架，该框架集成了三个摄像机（左，右和中心）以模仿人类的视野，并与自上而下的鸟眼视图语义数据相结合以增强上下文表示。传感器数据使用自发机制进行融合和编码，从而导致自动回归的Waypoint预测模块。我们将特征表示为顺序问题，采用视觉变压器来提炼传感器模式之间的上下文相互作用。在开放环和闭环设置中对所提出的方法的功效进行了实验评估。我们的方法在开环设置中实现了0.67m的位移误差，在Nuscenes数据集上超过了6.9％的当前方法。在对Carla Town05长和最长6个基准测试的闭环评估中，该方法可以增强驾驶性能，路线的完成和降低违规行为。

Learning contextual and spatial environmental representations enhances autonomous vehicle's hazard anticipation and decision-making in complex scenarios. Recent perception systems enhance spatial understanding with sensor fusion but often lack full environmental context. Humans, when driving, naturally employ neural maps that integrate various factors such as historical data, situational subtleties, and behavioral predictions of other road users to form a rich contextual understanding of their surroundings. This neural map-based comprehension is integral to making informed decisions on the road. In contrast, even with their significant advancements, autonomous systems have yet to fully harness this depth of human-like contextual understanding. Motivated by this, our work draws inspiration from human driving patterns and seeks to formalize the sensor fusion approach within an end-to-end autonomous driving framework. We introduce a framework that integrates three cameras (left, right, and center) to emulate the human field of view, coupled with top-down bird-eye-view semantic data to enhance contextual representation. The sensor data is fused and encoded using a self-attention mechanism, leading to an auto-regressive waypoint prediction module. We treat feature representation as a sequential problem, employing a vision transformer to distill the contextual interplay between sensor modalities. The efficacy of the proposed method is experimentally evaluated in both open and closed-loop settings. Our method achieves displacement error by 0.67m in open-loop settings, surpassing current methods by 6.9% on the nuScenes dataset. In closed-loop evaluations on CARLA's Town05 Long and Longest6 benchmarks, the proposed method enhances driving performance, route completion, and reduces infractions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题