论文标题
使用漏斗适应来满足STL任务的指导政策改进
Guided Policy Improvement for Satisfying STL Tasks using Funnel Adaptation
论文作者
论文摘要
我们引入了一种基于抽样的学习方法,用于解决涉及具有部分已知动态系统的任务满意度限制的最佳控制问题。控制问题是由要最小化的成本和要满足的任务来定义的,以信号时间逻辑(STL)的语言给出。可能的任务的复杂性质通常使它们难以通过随机探索来满足,这限制了学习算法的实际可行性。然而,最近的工作表明,使用控制器通过利用可用的系统动态知识来帮助任务满意度来指导学习过程非常有益于提高该方法的样本效率。在这些发现的激励下,这项工作引入了一个控制器推导框架,该框架自然会导致计算有效的控制器,能够在学习过程中提供此类指导。派生的控制器旨在满足一组所谓的鲁棒性规范或构成构成STL任务的原子命题的时间演变。理想情况下,这些规格的规定方式使它们的满意度会导致对STL任务的满意度。但是,实际上,这种理想的漏斗不一定是先验的,并且控制器提供的指导取决于其估计。特此解决了此问题,通过引入一种适应方案,用于在学习过程中自动更新Funnels,从而减少其初始用户指定值的作用。两个模拟案例研究证明了所得学习算法的有效性。
We introduce a sampling-based learning method for solving optimal control problems involving task satisfaction constraints for systems with partially known dynamics. The control problems are defined by a cost to be minimized and a task to be satisfied, given in the language of signal temporal logic (STL). The complex nature of possible tasks generally makes them difficult to satisfy through random exploration, which limits the practical feasibility of the learning algorithm. Recent work has shown, however, that using a controller to guide the learning process by leveraging available knowledge of system dynamics to aid task satisfaction is greatly beneficial for improving the sample efficiency of the method. Motivated by these findings, this work introduces a controller derivation framework which naturally leads to computationally efficient controllers capable of offering such guidance during the learning process. The derived controllers aim to satisfy a set of so-called robustness specifications or funnels that are imposed on the temporal evolutions of the atomic propositions composing the STL task. Ideally, these specifications are prescribed in a way such that their satisfaction would lead to satisfaction of the STL task. In practice, however, such ideal funnels are not necessarily known a priori, and the guidance the controller offers depends on their estimates. This issue is hereby addressed by introducing an adaptation scheme for automatically updating the funnels during the learning procedure, thus diminishing the role of their initial, user-specified values. The effectiveness of the resulting learning algorithm is demonstrated by two simulation case studies.