论文标题

平均奖励MDP中的最大透镜评估方法

A maximum-entropy approach to off-policy evaluation in average-reward MDPs

论文作者

Lazic, Nevena, Yin, Dong, Farajtabar, Mehrdad, Levine, Nir, Gorur, Dilan, Harris, Chris, Schuurmans, Dale

论文摘要

这项工作着重于违反政策评估(OPE),其功能近似在无限 - 摩尼子未识别的马尔可夫决策过程(MDPS)中。对于沿着线性和线性的MDP(即,在某些已知功能中奖励和动态是线性的),我们提供了第一个有限的样本OPE错误绑定,从而将现有结果扩展到了情节和打折的情况下。在更通用的环境中,当特征动力学大致是线性的,并且对于任意奖励时,我们提出了一种使用功能近似估算固定分布的新方法。我们将此问题提出,因为在经验动力学下找到了受匹配特征期望的最大透镜分布。我们表明,这导致了指数式的分布,其足够的统计数据是这些特征,与监督学习中的最大渗透方法并行。我们证明了在多种环境中提出的OPE方法的有效性。

This work focuses on off-policy evaluation (OPE) with function approximation in infinite-horizon undiscounted Markov decision processes (MDPs). For MDPs that are ergodic and linear (i.e. where rewards and dynamics are linear in some known features), we provide the first finite-sample OPE error bound, extending existing results beyond the episodic and discounted cases. In a more general setting, when the feature dynamics are approximately linear and for arbitrary rewards, we propose a new approach for estimating stationary distributions with function approximation. We formulate this problem as finding the maximum-entropy distribution subject to matching feature expectations under empirical dynamics. We show that this results in an exponential-family distribution whose sufficient statistics are the features, paralleling maximum-entropy approaches in supervised learning. We demonstrate the effectiveness of the proposed OPE approaches in multiple environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源