土匪算法的典型行为

论文标题

土匪算法的典型行为

The Typical Behavior of Bandit Algorithms

论文作者

Fan, Lin, Glynn, Peter W.

论文摘要

我们为两种最受欢迎的匪徒算法的遗憾（汤普森采样和UCB）制定了大量法律和中央限制定理。在这里，我们对遗憾分布的特征补充了Fan and Glynn（2021）最近开发的遗憾分布的特征（Arxiv：2109.13595）。那里的尾巴表征与轨迹上的非典型匪徒行为有关，在该轨迹中，最佳臂平均值被低估了，导致对最佳臂的错误识别和极大的遗憾。相比之下，我们的SLLN和CLT在这里描述了正确估计最佳手臂平均值的轨迹的典型行为和遗憾的波动。我们发现，汤普森采样和UCB满足了相同的SLLN和CLT，并且在CLT中的SLLN和（平均）中心序列的渐近学都与预期的遗憾的渐近学相匹配。 CLT的平均值和差异都以$ \ log（t）$ rative the Time Horizon $ t $增长。在$ t \ to \ infty $上渐近，每个亚最佳手臂的戏剧数量的变化仅取决于该ARM所获得的奖励，这表明每个亚最佳手臂都会对整体CLT差异独立贡献。

We establish strong laws of large numbers and central limit theorems for the regret of two of the most popular bandit algorithms: Thompson sampling and UCB. Here, our characterizations of the regret distribution complement the characterizations of the tail of the regret distribution recently developed by Fan and Glynn (2021) (arXiv:2109.13595). The tail characterizations there are associated with atypical bandit behavior on trajectories where the optimal arm mean is under-estimated, leading to mis-identification of the optimal arm and large regret. In contrast, our SLLN's and CLT's here describe the typical behavior and fluctuation of regret on trajectories where the optimal arm mean is properly estimated. We find that Thompson sampling and UCB satisfy the same SLLN and CLT, with the asymptotics of both the SLLN and the (mean) centering sequence in the CLT matching the asymptotics of expected regret. Both the mean and variance in the CLT grow at $\log(T)$ rates with the time horizon $T$. Asymptotically as $T \to \infty$, the variability in the number of plays of each sub-optimal arm depends only on the rewards received for that arm, which indicates that each sub-optimal arm contributes independently to the overall CLT variance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题