Markovian和I.I.D.的基于自适应KL-UCB的强盗算法设置

论文标题

Markovian和I.I.D.的基于自适应KL-UCB的强盗算法设置

Adaptive KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

论文作者

Roy, Arghyadip, Shakkottai, Sanjay, Srikant, R.

论文摘要

在基于遗憾的多军匪徒（MAB）问题的遗憾中，除非在极少数情况下，许多文献都集中在I.I.D.的手臂上。奖励。在本文中，我们考虑了获得MAB问题的遗憾保证的问题，其中每个手臂的奖励形成一个可能不属于单个参数指数家族的马尔可夫链。在此类问题上实现对数的遗憾并不困难：标准的Kullback-Leibler上置信度（KL-UCB）的变化可以完成这项工作。但是，由于以下原因，从这种分析获得的常数很差：i.i.d。奖励是马尔可夫奖励的特殊情况，很难设计一种算法，该算法与基础模型是真正的马尔可夫人还是I.I.D。为了克服这个问题，我们介绍了一种新颖的算法，该算法可以确定每个手臂的奖励是真正的马尔可夫人还是I.I.D.使用总基于距离的测试。然后，当我们的算法确定手臂奖励是马尔可夫人时，我们的算法从使用标准的KL-UCB转换为专门的KL-UCB版本，从而导致两者的遗憾。和马尔可夫设置。

In the regret-based formulation of Multi-armed Bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with i.i.d. rewards. In this paper, we consider the problem of obtaining regret guarantees for MAB problems in which the rewards of each arm form a Markov chain which may not belong to a single parameter exponential family. To achieve a logarithmic regret in such problems is not difficult: a variation of standard Kullback-Leibler Upper Confidence Bound (KL-UCB) does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a total variation distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regrets for both i.i.d. and Markovian settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题