遗憾的是信息指导的强化学习

论文标题

遗憾的是信息指导的强化学习

Regret Bounds for Information-Directed Reinforcement Learning

论文作者

Hao, Botao, Lattimore, Tor

论文摘要

以信息指导的采样（IDS）揭示了其作为增强学习（RL）的数据有效算法的潜力。但是，对Markov决策过程ID（MDP）的理论理解仍然有限。我们开发了新颖的信息理论工具，以限制有关学习目标的信息比率和累积信息获得。我们的理论结果阐明了选择学习目标的重要性，以便从业者可以平衡计算和后悔的界限。结果，我们为Vanilla-Ids提供了先前的贝叶斯遗憾界限，这在表格有限的Horizon MDP下学习了整个环境。此外，我们提出了一个计算效率的正规化ID，该ID可以最大化加性形式而不是比率形式，并表明它具有与Vanilla-IDS相同的遗憾。借助利率 - 延伸理论，我们通过学习替代，信息不足的环境来改善遗憾。此外，我们将分析扩展到线性MDP，并证明了汤普森采样作为副产品的类似遗憾界限。

Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for reinforcement learning (RL). However, theoretical understanding of IDS for Markov Decision Processes (MDPs) is still limited. We develop novel information-theoretic tools to bound the information ratio and cumulative information gain about the learning target. Our theoretical results shed light on the importance of choosing the learning target such that the practitioners can balance the computation and regret bounds. As a consequence, we derive prior-free Bayesian regret bounds for vanilla-IDS which learns the whole environment under tabular finite-horizon MDPs. In addition, we propose a computationally-efficient regularized-IDS that maximizes an additive form rather than the ratio form and show that it enjoys the same regret bound as vanilla-IDS. With the aid of rate-distortion theory, we improve the regret bound by learning a surrogate, less informative environment. Furthermore, we extend our analysis to linear MDPs and prove similar regret bounds for Thompson sampling as a by-product.

下载PDF全文

下载文献需遵守相关版权规定

论文标题