论文标题
汤普森采样线性约束的土匪
Thompson Sampling for Linearly Constrained Bandits
论文作者
论文摘要
我们解决了多臂匪徒(MAB),其中目的是在概率线性约束下最大化累积奖励。对于这个问题的一些现实世界实例,最近提出了著名的汤普森采样(TS)启发式的限制扩展。但是,对受约束TS的有限时间分析具有挑战性。结果,仅在累积奖励损失(即遗憾)上只有o(\ sqrt {t})可用。在本文中,我们描述了Linconts,这是一种基于TS的算法,用于土匪,对每回合获得奖励的可能性构成线性约束。我们表明,对于Linconts来说,遗憾和累积约束违规行为是由O(\ log t)的上限范围的。我们开发了一种证明技术,该技术依赖于对双重问题的仔细分析,并将其与有关无约束TS的最新理论工作相结合。通过在两个现实世界数据集上的数值实验,我们证明了Linconts在同时最大程度地减少遗憾和违规方面优于渐近最佳的上限限制(UCB)方案。
We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O(\sqrt{T}) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(\log T) for the suboptimal arms. We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.