论文标题
在线元批判性学习,用于范围的演员批评方法
Online Meta-Critic Learning for Off-Policy Actor-Critic Methods
论文作者
论文摘要
在各种连续的控制任务中,证明了范围的参与者 - 批评(PAFC)方法已成功。通常,评论家的行动价值功能会使用时间差异更新,而评论家又为演员提供了损失,该演员训练它采取具有更高预期回报的行动。在本文中,我们介绍了一种新颖而灵活的元评论,该元评论观察了学习过程,并为演员提供了额外的损失,从而加速和改善了参与者 - 批评者的学习。与香草评论家相比,元评论网络已明确培训以加速学习过程。与现有的元学习算法相比,元评论是在网上迅速学习的。至关重要的是,我们的元评论框架是为基于政策的学习者而设计的,该学习者目前提供最新的增强学习样本效率。我们证明,当在线元批判性学习与当代的外部离PAC方法DDPG,TD3和最先进的SAC相结合时,可以改善连续控制环境的贪婪。
Off-Policy Actor-Critic (Off-PAC) methods have proven successful in a variety of continuous control tasks. Normally, the critic's action-value function is updated using temporal-difference, and the critic in turn provides a loss for the actor that trains it to take actions with higher expected return. In this paper, we introduce a novel and flexible meta-critic that observes the learning process and meta-learns an additional loss for the actor that accelerates and improves actor-critic learning. Compared to the vanilla critic, the meta-critic network is explicitly trained to accelerate the learning process; and compared to existing meta-learning algorithms, meta-critic is rapidly learned online for a single task, rather than slowly over a family of tasks. Crucially, our meta-critic framework is designed for off-policy based learners, which currently provide state-of-the-art reinforcement learning sample efficiency. We demonstrate that online meta-critic learning leads to improvements in avariety of continuous control environments when combined with contemporary Off-PAC methods DDPG, TD3 and the state-of-the-art SAC.