时间适应性分层增强学习

论文标题

时间适应性分层增强学习

Temporal-adaptive Hierarchical Reinforcement Learning

论文作者

Zhou, Wen-Ji, Yu, Yang

论文摘要

分层增强学习（HRL）有助于解决强化学习中的大规模和稀疏奖励问题。在HRL中，策略模型具有以级别结构的内部表示形式。借助这种结构，预计将通过子任务分解为相应的级别，因此学习可以更有效。在HRL中，尽管直观的是高级策略只需要以低频率做出宏观决策，但很难简单地确定确切的频率。以前的HRL方法通常采用固定时间跳过策略或学习终端条件而不考虑上下文，但是，这不仅需要手动调整，而且还牺牲了一些决策粒度。在本文中，我们提出了\ emph {暂时自适应层次结构学习}（temple）结构，该结构使用时间门来适应高级政策决策频率。我们使用PPO训练圣殿结构，并在包括2D房间，Mujoco任务和Atari游戏在内的各种环境中测试其性能。结果表明，寺庙的结构可以通过顺序自适应的高级控制在这些环境中改善性能。

Hierarchical reinforcement learning (HRL) helps address large-scale and sparse reward issues in reinforcement learning. In HRL, the policy model has an inner representation structured in levels. With this structure, the reinforcement learning task is expected to be decomposed into corresponding levels with sub-tasks, and thus the learning can be more efficient. In HRL, although it is intuitive that a high-level policy only needs to make macro decisions in a low frequency, the exact frequency is hard to be simply determined. Previous HRL approaches often employed a fixed-time skip strategy or learn a terminal condition without taking account of the context, which, however, not only requires manual adjustments but also sacrifices some decision granularity. In this paper, we propose the \emph{temporal-adaptive hierarchical policy learning} (TEMPLE) structure, which uses a temporal gate to adaptively control the high-level policy decision frequency. We train the TEMPLE structure with PPO and test its performance in a range of environments including 2-D rooms, Mujoco tasks, and Atari games. The results show that the TEMPLE structure can lead to improved performance in these environments with a sequential adaptive high-level control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题