论文标题
通过连续分布批评家的截短混合物控制高估偏置
Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
论文作者
论文摘要
高估偏差是准确的非政策学习的主要障碍之一。本文研究了一种减轻连续控制环境中高估偏见的新颖方法。我们的方法---截断的分位数评论家,TQC, - - 融合了三个想法:评论家的分布表示,批评家预测的截断以及多个批评家的结合。分布表示和截断允许任意颗粒状高估控制,而结合则提供了进一步的分数改进。从连续的控制基准套件中,TQC在所有环境中的当前状态都优于当前的艺术状态,这表明对最具挑战性的人形生物环境有25%的改善。
The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.