在线元批判性学习，用于范围的演员批评方法

论文标题

在线元批判性学习，用于范围的演员批评方法

Online Meta-Critic Learning for Off-Policy Actor-Critic Methods

论文作者

Zhou, Wei, Li, Yiying, Yang, Yongxin, Wang, Huaimin, Hospedales, Timothy M.

论文摘要

在各种连续的控制任务中，证明了范围的参与者 - 批评（PAFC）方法已成功。通常，评论家的行动价值功能会使用时间差异更新，而评论家又为演员提供了损失，该演员训练它采取具有更高预期回报的行动。在本文中，我们介绍了一种新颖而灵活的元评论，该元评论观察了学习过程，并为演员提供了额外的损失，从而加速和改善了参与者 - 批评者的学习。与香草评论家相比，元评论网络已明确培训以加速学习过程。与现有的元学习算法相比，元评论是在网上迅速学习的。至关重要的是，我们的元评论框架是为基于政策的学习者而设计的，该学习者目前提供最新的增强学习样本效率。我们证明，当在线元批判性学习与当代的外部离PAC方法DDPG，TD3和最先进的SAC相结合时，可以改善连续控制环境的贪婪。

Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题