汤普森采样政策的政策梯度优化

论文标题

汤普森采样政策的政策梯度优化

Policy Gradient Optimization of Thompson Sampling Policies

论文作者

Min, Seungki, Moallemi, Ciamac C., Russo, Daniel J.

论文摘要

我们研究了政策梯度算法的使用来优化一类广义的汤普森采样策略。我们的中心见解是将汤普森采样采样的后验参数视为一种伪行动。然后，策略梯度方法然后可以仔细应用于搜索一类采样策略，该策略确定了伪actions（即采样参数）的概率分布，这是观察到的数据的函数。我们还建议并比较专门针对贝叶斯匪徒问题的政策梯度估计器。数值实验表明，汤普森采样之上的直接策略搜索会自动纠正某些算法的已知缺点，即使在标准汤普森采样非常有效的长期地平线问题中，即使在长期的地平线问题中也提供了有意义的改进。

We study the use of policy gradient algorithms to optimize over a class of generalized Thompson sampling policies. Our central insight is to view the posterior parameter sampled by Thompson sampling as a kind of pseudo-action. Policy gradient methods can then be tractably applied to search over a class of sampling policies, which determine a probability distribution over pseudo-actions (i.e., sampled parameters) as a function of observed data. We also propose and compare policy gradient estimators that are specialized to Bayesian bandit problems. Numerical experiments demonstrate that direct policy search on top of Thompson sampling automatically corrects for some of the algorithm's known shortcomings and offers meaningful improvements even in long horizon problems where standard Thompson sampling is extremely effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题