论文标题
双发生器离线增强学习
Dual Generator Offline Reinforcement Learning
论文作者
论文摘要
在离线RL中,将学习的策略限制为保持数据是至关重要的,这对于防止策略以错误高估的值输出分数(OOD)动作。原则上,生成的对抗网络(GAN)可以为此提供优雅的解决方案,而歧视器直接提供了量化分布转移的概率。但是,在实践中,基于GAN的离线RL方法没有执行替代方法,也许是因为生成器经过训练以欺骗歧视者并最大程度地提高回报 - 两个可能相互矛盾的目标。在本文中,我们表明,可以通过培训两个发电机来解决相互冲突的目标问题:一个可以最大化返回,另一个发电机捕获了离线数据集中数据分布的``剩余'',因此两者的混合物与行为策略接近。我们表明,不仅有两个发电机启用了有效的基于GAN的离线RL方法,而且还近似于支持约束,该策略不需要匹配整个数据分布,还只能匹配导致长期性能的数据切片。我们将我们的方法命名为Dasco,用于双生剂的对抗支撑受限的离线RL。在需要从亚最佳数据中学习的基准任务上,DASCO明显优于实施分布约束的先前方法。
In offline RL, constraining the learned policy to remain close to the data is essential to prevent the policy from outputting out-of-distribution (OOD) actions with erroneously overestimated values. In principle, generative adversarial networks (GAN) can provide an elegant solution to do so, with the discriminator directly providing a probability that quantifies distributional shift. However, in practice, GAN-based offline RL methods have not performed as well as alternative approaches, perhaps because the generator is trained to both fool the discriminator and maximize return -- two objectives that can be at odds with each other. In this paper, we show that the issue of conflicting objectives can be resolved by training two generators: one that maximizes return, with the other capturing the ``remainder'' of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy. We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint, where the policy does not need to match the entire data distribution, but only the slice of the data that leads to high long term performance. We name our method DASCO, for Dual-Generator Adversarial Support Constrained Offline RL. On benchmark tasks that require learning from sub-optimal data, DASCO significantly outperforms prior methods that enforce distribution constraint.