平均奖励MDP中的最大透镜评估方法

论文标题

平均奖励MDP中的最大透镜评估方法

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

论文作者

Lazic, Nevena, Yin, Dong, Farajtabar, Mehrdad, Levine, Nir, Gorur, Dilan, Harris, Chris, Schuurmans, Dale

论文摘要

这项工作着重于违反政策评估（OPE），其功能近似在无限 - 摩尼子未识别的马尔可夫决策过程（MDPS）中。对于沿着线性和线性的MDP（即，在某些已知功能中奖励和动态是线性的），我们提供了第一个有限的样本OPE错误绑定，从而将现有结果扩展到了情节和打折的情况下。在更通用的环境中，当特征动力学大致是线性的，并且对于任意奖励时，我们提出了一种使用功能近似估算固定分布的新方法。我们将此问题提出，因为在经验动力学下找到了受匹配特征期望的最大透镜分布。我们表明，这导致了指数式的分布，其足够的统计数据是这些特征，与监督学习中的最大渗透方法并行。我们证明了在多种环境中提出的OPE方法的有效性。

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题