论文标题

在非平稳环境中的组合半伴侣

Combinatorial Semi-Bandit in the Non-Stationary Environment

论文作者

Chen, Wei, Wang, Liwei, Zhao, Haoyu, Zheng, Kai

论文摘要

在本文中,我们在切换情况下和动态情况下研究了非平稳组合半伴侣问题。在(a)(a)奖励函数是非线性的一般情况下,(b)手臂可能是概率触发的,(c)仅存在近似离线的Oracle \ cite {wang2017improving},我们的algorithm Achieves $ \ tilde { distribution-dependent regret in the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$ is the sum of the total ``distribution changes''.在两种情况下的遗憾界限几乎都是最佳的,但是我们的算法需要提前知道参数$ \ MATHCAL S $或$ \ MATHCAL V $。 我们进一步表明,通过采用另一种技术,我们的算法不再需要知道参数$ \ MATHCAL S $或$ \ MATHCAL V $,但是遗憾的界限可能会变得次优。 在奖励函数是线性并且具有精确的Oracle的特殊情况下,我们设计了一种无参数的算法,该算法在切换情况下和动态案例中都在不知道参数的情况下几乎获得了最佳的遗憾。

In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists \cite{wang2017improving}, our algorithm achieves $\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ distribution-dependent regret in the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$ is the sum of the total ``distribution changes''. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter $\mathcal S$ or $\mathcal V$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters $\mathcal S$ or $\mathcal V$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源