运行时安全指导的政策维修

论文标题

运行时安全指导的政策维修

Runtime-Safety-Guided Policy Repair

论文作者

Zhou, Weichao, Gao, Ruihan, Kim, BaekGyu, Kang, Eunsuk, Li, Wenchao

论文摘要

我们研究了在安全 - 关键环境中基于学习的控制政策的政策修复问题。我们考虑一种建筑，基于高性能学习的控制策略（例如，被培训为神经网络）与基于模型的安全控制器配对。安全控制器具有预测训练有素的政策是否会导致系统到达不安全状态的能力，并在必要时接管控制。尽管这种体系结构可以提供更多的安全保证，但训练有素的政策之间的间歇性和频繁切换可以导致不良行为和绩效降低。我们建议通过“维修”安全控制器生成的运行时数据来减少甚至消除控制切换，以极少偏离原始策略的方式。我们方法背后的关键思想是制定轨迹优化问题，该问题允许联合推理政策更新和安全限制。实验结果表明，即使安全控制器中的系统模型未知并且仅近似，我们的方法也是有效的。

We study the problem of policy repair for learning-based control policies in safety-critical settings. We consider an architecture where a high-performance learning-based control policy (e.g. one trained as a neural network) is paired with a model-based safety controller. The safety controller is endowed with the abilities to predict whether the trained policy will lead the system to an unsafe state, and take over control when necessary. While this architecture can provide added safety assurances, intermittent and frequent switching between the trained policy and the safety controller can result in undesirable behaviors and reduced performance. We propose to reduce or even eliminate control switching by `repairing' the trained policy based on runtime data produced by the safety controller in a way that deviates minimally from the original policy. The key idea behind our approach is the formulation of a trajectory optimization problem that allows the joint reasoning of policy update and safety constraints. Experimental results demonstrate that our approach is effective even when the system model in the safety controller is unknown and only approximated.

下载PDF全文

下载文献需遵守相关版权规定

论文标题