与多种行为策略进行非政策评估的最佳混合物重量

论文标题

与多种行为策略进行非政策评估的最佳混合物重量

Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior Policies

论文作者

Lai, Jinlin, Zou, Lixin, Song, Jiaxing

论文摘要

非政策评估是增强学习的关键组成部分，它通过从行为策略收集的离线数据来评估目标策略。这是迈向安全加强学习的关键一步，并且已用于广告，推荐系统和许多其他应用程序。在这些应用程序中，有时从多个行为策略收集离线数据。以前的工作同样考虑了来自不同行为策略的数据。然而，某些行为政策在产生良好的估计器方面更好，而另一些行为政策则没有。本文首先讨论如何正确混合不同行为策略产生的估计量。当所有子估计剂均无偏见或无偏见时，我们提出了三种方法来减少混合估计量的方差。此外，对模拟推荐系统进行的实验表明，我们的方法有效地减少了均方估计误差。

Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies. It is a crucial step towards safe reinforcement learning and has been used in advertisement, recommender systems and many other applications. In these applications, sometimes the offline data is collected from multiple behavior policies. Previous works regard data from different behavior policies equally. Nevertheless, some behavior policies are better at producing good estimators while others are not. This paper starts with discussing how to correctly mix estimators produced by different behavior policies. We propose three ways to reduce the variance of the mixture estimator when all sub-estimators are unbiased or asymptotically unbiased. Furthermore, experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题