逻辑Q学习

论文标题

Logistic Q-Learning

论文作者

Bas-Serrano, Joan, Curi, Sebastian, Krause, Andreas, Neu, Gergely

论文摘要

我们提出了一种新的强化学习算法，该算法从MDP中最佳控制的正则线性编程公式中得出。该方法与Peters等人的经典相对熵策略搜索（REP）算法密切相关。（2010年），关键区别是我们的方法引入了Q功能，该Q功能可以实现有效的精确无模型实现。我们算法的主要功能（称为QREP）是策略评估的凸损失函数，它是广泛使用的平方钟声错误的理论上声音替代方案。我们提供了一种实用的鞍点优化方法，用于最大程度地减少此损失功能，并提供错误传播分析，该分析将单个更新的质量与输出策略的性能相关联。最后，我们证明了我们方法对一系列基准问题的有效性。

We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The method is closely related to the classic Relative Entropy Policy Search (REPS) algorithm of Peters et al. (2010), with the key difference that our method introduces a Q-function that enables efficient exact model-free implementation. The main feature of our algorithm (called QREPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error. We provide a practical saddle-point optimization method for minimizing this loss function and provide an error-propagation analysis that relates the quality of the individual updates to the performance of the output policy. Finally, we demonstrate the effectiveness of our method on a range of benchmark problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

逻辑Q学习

Logistic Q-Learning

论文作者

论文摘要

加入微信交流群