论文标题
通过课程入学的安全加强学习
Safe Reinforcement Learning via Curriculum Induction
论文作者
论文摘要
在关键安全应用中,自主代理可能需要在错误可能非常昂贵的环境中学习。在这种情况下,代理人不仅需要在学习时安全地行事。为了实现这一目标,现有的安全加强学习方法使代理商依靠先验,使其在探索过程中避免危险情况,但在许多利益的情况下,例如自主驱动等许多兴趣的情况,概率的保证和先验的平稳性假设都不可行。本文提出了一种灵感来自人类教学的替代方法,该方法在自动教练的监督下学习,该方法使代理人免于学习期间违反限制。在此模型中,我们介绍了监视器,即都不需要知道如何在代理商学习的任务上做好工作,也不需要知道环境的工作原理。取而代之的是,它具有一个重置控制器库,当代理人开始危险行为时,它会激活,从而阻止其造成损坏。至关重要的是,在哪种情况下,重置控制器的选择会影响代理学习的速度。根据观察代理的进步,教师本身学习了选择重置控制器课程的政策,以优化代理商的最终政策奖励。我们的实验在两个环境中使用此框架来诱导课程以进行安全有效学习。
In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this model, we introduce the monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress, the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two environments to induce curricula for safe and efficient learning.