论文标题
边缘化操作员用于非政策增强学习
Marginalized Operators for Off-policy Reinforcement Learning
论文作者
论文摘要
在这项工作中,我们提出了边缘化运营商,这是一类新的非政策评估操作员,用于增强学习。边缘化的操作员严格将通用多步操作员(例如回溯)概括为特殊情况。与原始多步操作员的基于样本的估计值相比,边缘化操作员还提出了一种基于样本估计的形式,具有降低的潜在方差。我们表明,可以以可扩展的方式计算边缘化运算符的估计值,这也将边缘化重要性抽样的先前结果推广为特殊情况。最后,我们从经验上证明,边缘化运营商为非政策评估和下游政策优化算法提供了绩效提高。
In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalized operators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases. Finally, we empirically demonstrate that marginalized operators provide performance gains to off-policy evaluation and downstream policy optimization algorithms.