论文标题
一个联合模仿增强学习框架,用于减少基线后悔
A Joint Imitation-Reinforcement Learning Framework for Reduced Baseline Regret
论文作者
论文摘要
在各种控制任务域中,现有控制器提供了基线的性能水平,虽然可能是次优的 - 应维护。依赖于国家和行动空间的广泛探索的强化学习(RL)算法可用于优化控制策略。但是,完全探索性的RL算法可能会在训练期间降低基线水平以下的性能。在本文中,我们讨论了控制政策的在线优化问题,同时将遗憾的遗憾最小化。我们提出了一个共同的模仿学习学习框架,表示吉尔。 Jirl中的学习过程假定基线政策的可用性,并设计了两个目标\ TextBf {(a)}利用基线的在线演示来最大程度地减少培训期间的遗憾W.R.T的基线政策,而\ textbf {(b)}最终超过了基线表现。 Jirl通过最初学习模仿基线策略并逐渐将控制从基线转移到RL代理来解决这些目标。实验结果表明,JIRL有效地实现了几个连续的动作空间域中的上述目标。结果表明,JIRL与最终性能中的最先进算法相当,同时在所有提出的域中训练期间都会降低基线后悔。此外,结果表明,对于最先进的基线遗憾最小化方法,基线后悔的降低因素最高可达21美元。
In various control task domains, existing controllers provide a baseline level of performance that -- though possibly suboptimal -- should be maintained. Reinforcement learning (RL) algorithms that rely on extensive exploration of the state and action space can be used to optimize a control policy. However, fully exploratory RL algorithms may decrease performance below a baseline level during training. In this paper, we address the issue of online optimization of a control policy while minimizing regret w.r.t a baseline policy performance. We present a joint imitation-reinforcement learning framework, denoted JIRL. The learning process in JIRL assumes the availability of a baseline policy and is designed with two objectives in mind \textbf{(a)} leveraging the baseline's online demonstrations to minimize the regret w.r.t the baseline policy during training, and \textbf{(b)} eventually surpassing the baseline performance. JIRL addresses these objectives by initially learning to imitate the baseline policy and gradually shifting control from the baseline to an RL agent. Experimental results show that JIRL effectively accomplishes the aforementioned objectives in several, continuous action-space domains. The results demonstrate that JIRL is comparable to a state-of-the-art algorithm in its final performance while incurring significantly lower baseline regret during training in all of the presented domains. Moreover, the results show a reduction factor of up to $21$ in baseline regret over a state-of-the-art baseline regret minimization approach.