通过偏好引导的随机探索来采样有效的深钢筋学习

论文标题

通过偏好引导的随机探索来采样有效的深钢筋学习

Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration

论文作者

Huang, Wenhui, Zhang, Cong, Wu, Jingda, He, Xiangkun, Zhang, Jie, Lv, Chen

论文摘要

Deep Q-Network（DQN）算法解决的大规模实践工作表明，随机政策尽管简单，但最常用的探索方法是最常用的探索方法。但是，大多数现有的随机探索方法要么探索新的动作，无论Q值如何，要么不可避免地会在学习过程中引入偏见，以将抽样与Q值搭配。在本文中，我们提出了一种新颖的偏好指导$ε$ - 梅迪探索算法，该算法可以在不引入额外偏见的情况下根据Q值的Q值有效地学习动作分布。具体而言，我们设计了一个由两个分支组成的双重体系结构，其中一个是DQN的副本，即Q Branch。我们称为首选项分支的另一个分支，了解了DQN隐式所遵循的动作偏好。从理论上讲，我们证明了策略改进定理适用于偏好指导的$ε$ - 绿色政策，并在实验上表明，推断的动作偏好分布与相应的Q值的景观保持一致。因此，偏好指导的$ε$ - 纠正探索激励了DQN代理采取多样化的动作，即可以更频繁地采样较大的Q值行动，而使用较小的Q值的动作仍然有机会探索，从而鼓励探索。我们在九个不同的环境中使用四个众所周知的DQN变体评估了提出的方法。广泛的结果证实了我们提出的方法在性能和收敛速度方面的优势。索引术语 - 偏好引导的探索，随机政策，数据效率，深度强化学习，深度Q学习。

Massive practical works addressed by Deep Q-network (DQN) algorithm have indicated that stochastic policy, despite its simplicity, is the most frequently used exploration approach. However, most existing stochastic exploration approaches either explore new actions heuristically regardless of Q-values or inevitably introduce bias into the learning process to couple the sampling with Q-values. In this paper, we propose a novel preference-guided $ε$-greedy exploration algorithm that can efficiently learn the action distribution in line with the landscape of Q-values for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely the Q-branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicit follows. We theoretically prove that the policy improvement theorem holds for the preference-guided $ε$-greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding Q-values. Consequently, preference-guided $ε$-greedy exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration. We assess the proposed method with four well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed. Index Terms- Preference-guided exploration, stochastic policy, data efficiency, deep reinforcement learning, deep Q-learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题