Conqur：减轻深度Q学习中的妄想偏见

论文标题

Conqur：减轻深度Q学习中的妄想偏见

ConQUR: Mitigating Delusional Bias in Deep Q-learning

论文作者

Su, Andy, Ooi, Jayden, Lu, Tyler, Schuurmans, Dale, Boutilier, Craig

论文摘要

妄想偏见是近似Q学习的基本错误来源。迄今为止，明确解决妄想的唯一技术需要使用表格值估算的全面搜索。在本文中，我们开发了有效的方法来通过训练与基础贪婪政策类“一致”的标签来训练Q- Approximators来减轻妄想偏见。我们介绍了一个简单的惩罚计划，该计划鼓励在培训批处理中使用的Q标签，以保持与表达政策类别一致的（共同）。我们还提出了一个搜索框架，该框架允许生成和跟踪多个Q-AppRoximators，从而减轻早产（隐式）策略承诺的影响。实验结果表明，这些方法可以在各种Atari游戏中提高Q学习的性能，有时还会显着。

Delusional bias is a fundamental source of error in approximate Q-learning. To date, the only techniques that explicitly address delusion require comprehensive search using tabular value estimates. In this paper, we develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class. We introduce a simple penalization scheme that encourages Q-labels used across training batches to remain (jointly) consistent with the expressible policy class. We also propose a search framework that allows multiple Q-approximators to be generated and tracked, thus mitigating the effect of premature (implicit) policy commitments. Experimental results demonstrate that these methods can improve the performance of Q-learning in a variety of Atari games, sometimes dramatically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题