论文标题

Conqur:减轻深度Q学习中的妄想偏见

ConQUR: Mitigating Delusional Bias in Deep Q-learning

论文作者

Su, Andy, Ooi, Jayden, Lu, Tyler, Schuurmans, Dale, Boutilier, Craig

论文摘要

妄想偏见是近似Q学习的基本错误来源。迄今为止,明确解决妄想的唯一技术需要使用表格值估算的全面搜索。在本文中,我们开发了有效的方法来通过训练与基础贪婪政策类“一致”的标签来训练Q- Approximators来减轻妄想偏见。我们介绍了一个简单的惩罚计划,该计划鼓励在培训批处理中使用的Q标签,以保持与表达政策类别一致的(共同)。我们还提出了一个搜索框架,该框架允许生成和跟踪多个Q-AppRoximators,从而减轻早产(隐式)策略承诺的影响。实验结果表明,这些方法可以在各种Atari游戏中提高Q学习的性能,有时还会显着。

Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源