利用在线加强学习中的离线数据

论文标题

利用在线加强学习中的离线数据

Leveraging Offline Data in Online Reinforcement Learning

论文作者

Wagenmaker, Andrew, Pacchiano, Aldo

论文摘要

在强化学习（RL）社区中出现了两个中心范式：在线RL和离线RL。在在线RL设置中，代理没有对环境的先验知识，必须与之互动才能找到$ε$最佳的策略。在离线RL设置中，学习者可以访问固定的数据集供学习，但无法与环境进行交互，并且必须从此离线数据中获得最佳的策略。实际场景通常会激发中间设置：如果我们有一组离线数据，并且还可以与环境进行互动，我们如何最好地使用离线数据来最大程度地减少学习$ε$ - 最佳策略所需的在线互动数量？在这项工作中，我们考虑使用线性结构的MDP，我们将此设置称为\ textsf {finetunerl}设置。我们在此设置中表征了对某些离线数据集的访问，并开发了算法，\ textsc {ftpedel}，这是最佳的，最佳，最高$ h $因素。我们通过一个明确的示例表明，将离线数据与在线互动相结合可以导致对纯粹的离线或纯粹在线RL的可证明改进。最后，我们的结果说明了\ emph {可验证}学习，在线RL中考虑的典型设置和\ emph {Unverifiable}学习之间的区别，在离线RL中经常考虑的设置，并表明这些制度之间存在正式分离。

Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $ε$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $ε$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal, up to $H$ factors. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题