论文标题

ε-BMC:一种贝叶斯合奏的方法

ε-BMC: A Bayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning

论文作者

Gimelfarb, Michael, Sanner, Scott, Lee, Chi-Guhn

论文摘要

解决勘探探索折衷权取舍仍然是强化学习(RL)算法的设计和实施中的一个基本问题。在本文中,我们使用Epsilon-Greedy勘探政策专注于无模型RL,尽管它简单,但它仍然是最常用的探索形式之一。但是,此政策的关键限制是$ \ varepsilon $的规范。在本文中,我们提供了$ \ varepsilon $的新型贝叶斯视角,以衡量Q值函数的均匀性。我们基于这个新的视角引入了基于贝叶斯模型组合(BMC)的封闭形式的贝叶斯模型更新,该更新使我们能够使用环境中的经验不断使用单调收敛保证,从而使我们能够适应$ \ varepsilon $。我们证明了我们提出的算法,$ \ varepsilon $ - \ texttt {bmc},有效地平衡了在不同问题上的探索和剥削,执行相当或胜过最佳调用的固定固定退火时间表,以及替代数据依赖于$ \ varepsilon $ Affaptation $ Affaptation $ Affaptation $ Affaptation $ Affaptation profipation process process process primatature process primatator primatature primatature primatator primatator primatator primatitation。

Resolving the exploration-exploitation trade-off remains a fundamental problem in the design and implementation of reinforcement learning (RL) algorithms. In this paper, we focus on model-free RL using the epsilon-greedy exploration policy, which despite its simplicity, remains one of the most frequently used forms of exploration. However, a key limitation of this policy is the specification of $\varepsilon$. In this paper, we provide a novel Bayesian perspective of $\varepsilon$ as a measure of the uniformity of the Q-value function. We introduce a closed-form Bayesian model update based on Bayesian model combination (BMC), based on this new perspective, which allows us to adapt $\varepsilon$ using experiences from the environment in constant time with monotone convergence guarantees. We demonstrate that our proposed algorithm, $\varepsilon$-\texttt{BMC}, efficiently balances exploration and exploitation on different problems, performing comparably or outperforming the best tuned fixed annealing schedules and an alternative data-dependent $\varepsilon$ adaptation scheme proposed in the literature.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源