论文标题
遗憾多个最好的手臂
On Regret with Multiple Best Arms
论文作者
论文摘要
我们研究了多臂匪徒中多个最佳/近乎最佳武器的遗憾最小化问题。当武器/动作的数量比时间范围内且对匪徒实例的结构没有假设时,我们考虑了这种情况。我们的目标是设计可以自动适应问题的未知硬度的算法,即最佳武器数量。我们的设置捕获了许多现代化算法的现代应用,其中动作空间是巨大的,有关基础实例/结构的信息不可用。我们首先提出了一种自适应算法,该算法对硬度水平不可知,从理论上讲是遗憾的。然后,我们证明了针对我们的问题设置的下限,这表明:(1)在所有硬度级别上都不能同时最佳算法; (2)我们的算法达到了帕累托最佳的速率函数。鉴于最佳臂的预期奖励的更多知识,我们提出了另一种自适应算法,该算法是最佳的最佳选择,均在所有硬度级别上,均超过各个硬度因素。实验结果证实了我们的理论保证,并显示了我们的算法比以前的最先进的优势。
We study a regret minimization problem with the existence of multiple best/near-optimal arms in the multi-armed bandit setting. We consider the case when the number of arms/actions is comparable or much larger than the time horizon, and make no assumptions about the structure of the bandit instance. Our goal is to design algorithms that can automatically adapt to the unknown hardness of the problem, i.e., the number of best arms. Our setting captures many modern applications of bandit algorithms where the action space is enormous and the information about the underlying instance/structure is unavailable. We first propose an adaptive algorithm that is agnostic to the hardness level and theoretically derive its regret bound. We then prove a lower bound for our problem setting, which indicates: (1) no algorithm can be minimax optimal simultaneously over all hardness levels; and (2) our algorithm achieves a rate function that is Pareto optimal. With additional knowledge of the expected reward of the best arm, we propose another adaptive algorithm that is minimax optimal, up to polylog factors, over all hardness levels. Experimental results confirm our theoretical guarantees and show advantages of our algorithms over the previous state-of-the-art.