边缘化操作员用于非政策增强学习

论文标题

边缘化操作员用于非政策增强学习

Marginalized Operators for Off-policy Reinforcement Learning

论文作者

Tang, Yunhao, Rowland, Mark, Munos, Rémi, Valko, Michal

论文摘要

在这项工作中，我们提出了边缘化运营商，这是一类新的非政策评估操作员，用于增强学习。边缘化的操作员严格将通用多步操作员（例如回溯）概括为特殊情况。与原始多步操作员的基于样本的估计值相比，边缘化操作员还提出了一种基于样本估计的形式，具有降低的潜在方差。我们表明，可以以可扩展的方式计算边缘化运算符的估计值，这也将边缘化重要性抽样的先前结果推广为特殊情况。最后，我们从经验上证明，边缘化运营商为非政策评估和下游政策优化算法提供了绩效提高。

In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators, such as Retrace, as special cases. Marginalized operators also suggest a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. We show that the estimates for marginalized operators can be computed in a scalable way, which also generalizes prior results on marginalized importance sampling as special cases. Finally, we empirically demonstrate that marginalized operators provide performance gains to off-policy evaluation and downstream policy optimization algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题