带有现实数据集的离线RL：HeteroSkedatosity和支持约束

论文标题

带有现实数据集的离线RL：HeteroSkedatosity和支持约束

Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints

论文作者

Singh, Anikait, Kumar, Aviral, Vuong, Quan, Chebotar, Yevgen, Levine, Sergey

论文摘要

离线增强学习（RL）完全从静态数据集中学习政策，从而避免了与在线数据收集相关的挑战。离线RL的实际应用不可避免地需要从数据集中学习，在该数据集中，所证明的行为的变异性在整个状态空间之间不均匀地变化。例如，在红灯下，几乎所有人类驾驶员都会停止行为，但是当合并到高速公路上时，一些驾驶员会迅速，有效，安全，安全地合并，而许多驾驶员则犹豫或危险地合并。从理论和经验上讲，我们都表明，基于分布约束的典型离线RL方法无法从具有这种不均匀可变性的数据中学习，这是由于要求在整个状态空间之间保持相同程度的行为策略。理想情况下，只要学识渊博的政策留在行为政策的支持下，就应该根据州自由选择如何遵循行为政策以最大程度地遵守行为政策。为了实例化这一原则，我们将保守Q学习（CQL）中的数据分布重新持续，以获得近似的支持约束公式。重新加权的分配是当前政策的混合体，是对行为政策下可能采取的不良行为进行培训的额外政策。我们的方法CQL（REDS）是简单的，理论上的动机，并提高了Atari游戏，导航和基于像素的操作的各种离线RL问题的性能。

Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题