在具有约束的内核多军匪徒上

论文标题

在具有约束的内核多军匪徒上

On Kernelized Multi-Armed Bandits with Constraints

论文作者

Zhou, Xingyu, Ji, Bo

论文摘要

我们研究了一个随机的匪徒问题，具有一般未知的奖励函数和一般未知的约束函数。这两个函数都可以是非线性的（甚至是非凸），并被认为位于具有有界规范的繁殖内核希尔伯特空间（RKHS）中。这种内核的强盗设置严格概括了标准的多臂匪徒和线性匪徒。与在先前的工作中研究的安全型硬约束相反，我们考虑了在任何一轮中可能违反的软限制，只要累积违规行为很小，这是由各种实际应用所激发的。我们的最终目标是研究如何利用软限制的性质，以在内核化的匪徒环境中实现更好的复杂性重新构成折衷。为此，我们利用原始二偶优化，为算法设计和性能分析提供了一个通用框架。该框架建立在一个新颖的状态上，不仅在一般探索策略下得到满足，包括\ emph {上限bond}（ucb），\ emph {thompson sampling}（ts）和基于\ emph {随机探索}的新的，而且还可以表现出统一的分析，以表现出统一的分析和sublinear offirent of Sublinear offirent of zere tocrination confirect of zere corblenear ofbinear offirent of Zero deblinear ofbinear ofbinear ofbinear offirent of Zero conerain。我们通过基于合成和现实世界数据集的数值实验来证明我们提出的算法的出色性能。在此过程中，我们还通过讨论分析中的关键差异和一些微妙之处，对分析约束匪徒的两种流行方法和马尔可夫决策过程（MDP）进行了第一个详细的比较，分析可能对社区具有独立的利益。

We study a stochastic bandit problem with a general unknown reward function and a general unknown constraint function. Both functions can be non-linear (even non-convex) and are assumed to lie in a reproducing kernel Hilbert space (RKHS) with a bounded norm. This kernelized bandit setup strictly generalizes standard multi-armed bandits and linear bandits. In contrast to safety-type hard constraints studied in prior works, we consider soft constraints that may be violated in any round as long as the cumulative violations are small, which is motivated by various practical applications. Our ultimate goal is to study how to utilize the nature of soft constraints to attain a finer complexity-regret-constraint trade-off in the kernelized bandit setting. To this end, leveraging primal-dual optimization, we propose a general framework for both algorithm design and performance analysis. This framework builds upon a novel sufficient condition, which not only is satisfied under general exploration strategies, including \emph{upper confidence bound} (UCB), \emph{Thompson sampling} (TS), and new ones based on \emph{random exploration}, but also enables a unified analysis for showing both sublinear regret and sublinear or even zero constraint violation. We demonstrate the superior performance of our proposed algorithms via numerical experiments based on both synthetic and real-world datasets. Along the way, we also make the first detailed comparison between two popular methods for analyzing constrained bandits and Markov decision processes (MDPs) by discussing the key difference and some subtleties in the analysis, which could be of independent interest to the communities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题