论文标题
一种仪器变量方法,用于混淆政策评估
An Instrumental Variable Approach to Confounded Off-Policy Evaluation
论文作者
论文摘要
非政策评估(OPE)是一种使用潜在的不同行为策略产生的一些预采用的观察数据来估算目标策略返回的方法。在某些情况下,可能会有未测量的变量会混淆动作奖励或行动隔离状态关系,从而使许多现有的OPE方法无效。本文开发了一种基于仪器变量(IV)的方法,用于在混杂的马尔可夫决策过程(MDPS)中进行一致的OPE。与单阶段的决策类似,我们表明IV使我们能够在无限的地平线设置中正确识别目标策略的价值。此外,我们提出了一个有效且稳健的价值估计器,并通过广泛的模拟和分析来自世界领先的短视频平台的真实数据来说明其有效性。
Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.