ε-BMC：一种贝叶斯合奏的方法

论文标题

ε-BMC：一种贝叶斯合奏的方法

ε-BMC: A Bayesian Ensemble Approach to Epsilon-Greedy Exploration in Model-Free Reinforcement Learning

论文作者

Gimelfarb, Michael, Sanner, Scott, Lee, Chi-Guhn

论文摘要

解决勘探探索折衷权取舍仍然是强化学习（RL）算法的设计和实施中的一个基本问题。在本文中，我们使用Epsilon-Greedy勘探政策专注于无模型RL，尽管它简单，但它仍然是最常用的探索形式之一。但是，此政策的关键限制是$ \ varepsilon $的规范。在本文中，我们提供了$ \ varepsilon $的新型贝叶斯视角，以衡量Q值函数的均匀性。我们基于这个新的视角引入了基于贝叶斯模型组合（BMC）的封闭形式的贝叶斯模型更新，该更新使我们能够使用环境中的经验不断使用单调收敛保证，从而使我们能够适应$ \ varepsilon $。我们证明了我们提出的算法，$ \ varepsilon $ - \ texttt {bmc}，有效地平衡了在不同问题上的探索和剥削，执行相当或胜过最佳调用的固定固定退火时间表，以及替代数据依赖于$ \ varepsilon $ Affaptation $ Affaptation $ Affaptation $ Affaptation $ Affaptation profipation process process process primatature process primatator primatature primatature primatator primatator primatator primatitation。

Resolving the exploration-exploitation trade-off remains a fundamental problem in the design and implementation of reinforcement learning (RL) algorithms. In this paper, we focus on model-free RL using the epsilon-greedy exploration policy, which despite its simplicity, remains one of the most frequently used forms of exploration. However, a key limitation of this policy is the specification of $\varepsilon$. In this paper, we provide a novel Bayesian perspective of $\varepsilon$ as a measure of the uniformity of the Q-value function. We introduce a closed-form Bayesian model update based on Bayesian model combination (BMC), based on this new perspective, which allows us to adapt $\varepsilon$ using experiences from the environment in constant time with monotone convergence guarantees. We demonstrate that our proposed algorithm, $\varepsilon$-\texttt{BMC}, efficiently balances exploration and exploitation on different problems, performing comparably or outperforming the best tuned fixed annealing schedules and an alternative data-dependent $\varepsilon$ adaptation scheme proposed in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题