论文标题
在稀疏奖励加强学习中进行深入探索的长期访问价值
Long-Term Visitation Value for Deep Exploration in Sparse Reward Reinforcement Learning
论文作者
论文摘要
稀疏奖励学习的加强学习仍然是一个开放的挑战。经典方法依赖于通过外部奖励获得反馈来训练代理商,在这种情况下,代理很少会学习或根本无法学习。同样,如果代理也收到创建目标函数次优模式的奖励,则可能会过早停止探索。最新的方法增加了辅助固有的奖励,以鼓励探索。但是,辅助奖励导致Q功能的非平稳目标。在本文中,我们提出了一种新颖的方法,该方法(1)通过使用长期访问计数将探索行动计划远离未来,以及(2)通过学习评估行动探索价值的单独函数来解开探索和剥削。与使用奖励和动态模型的现有方法相反,我们的方法是非政策且无模型的。我们进一步提出了新的表格环境,以基准在增强学习中进行探索。对经典和新颖基准测试的经验结果表明,所提出的方法在稀疏奖励的环境中的现有方法优于现有方法,尤其是在奖励的存在下创造了目标函数次优模式的情况下。结果还表明,我们的方法随环境的大小而优雅地扩展。源代码可从https://github.com/sparisi/visit-value-ecplore获得
Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods which use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Source code is available at https://github.com/sparisi/visit-value-explore