汤普森采样线性约束的土匪

论文标题

汤普森采样线性约束的土匪

Thompson Sampling for Linearly Constrained Bandits

论文作者

Saxena, Vidit, Gonzalez, Joseph E., Jaldén, Joakim

论文摘要

我们解决了多臂匪徒（MAB），其中目的是在概率线性约束下最大化累积奖励。对于这个问题的一些现实世界实例，最近提出了著名的汤普森采样（TS）启发式的限制扩展。但是，对受约束TS的有限时间分析具有挑战性。结果，仅在累积奖励损失（即遗憾）上只有o（\ sqrt {t}）可用。在本文中，我们描述了Linconts，这是一种基于TS的算法，用于土匪，对每回合获得奖励的可能性构成线性约束。我们表明，对于Linconts来说，遗憾和累积约束违规行为是由O（\ log t）的上限范围的。我们开发了一种证明技术，该技术依赖于对双重问题的仔细分析，并将其与有关无约束TS的最新理论工作相结合。通过在两个现实世界数据集上的数值实验，我们证明了Linconts在同时最大程度地减少遗憾和违规方面优于渐近最佳的上限限制（UCB）方案。

We address multi-armed bandits (MAB) where the objective is to maximize the cumulative reward under a probabilistic linear constraint. For a few real-world instances of this problem, constrained extensions of the well-known Thompson Sampling (TS) heuristic have recently been proposed. However, finite-time analysis of constrained TS is challenging; as a result, only O(\sqrt{T}) bounds on the cumulative reward loss (i.e., the regret) are available. In this paper, we describe LinConTS, a TS-based algorithm for bandits that place a linear constraint on the probability of earning a reward in every round. We show that for LinConTS, the regret as well as the cumulative constraint violations are upper bounded by O(\log T) for the suboptimal arms. We develop a proof technique that relies on careful analysis of the dual problem and combine it with recent theoretical work on unconstrained TS. Through numerical experiments on two real-world datasets, we demonstrate that LinConTS outperforms an asymptotically optimal upper confidence bound (UCB) scheme in terms of simultaneously minimizing the regret and the violation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题