论文标题
具有中间观测的非平稳延迟匪徒
Non-Stationary Delayed Bandits with Intermediate Observations
论文作者
论文摘要
在线推荐系统通常会在接收反馈时面临较长的延迟,尤其是在针对某些长期指标进行优化时。在减轻学习延迟的影响的同时,在固定环境中可以理解,但当环境发生变化时,问题变得更加具有挑战性。实际上,如果更改的时间表与延迟相当,则无法了解环境,因为可用的观察结果已经过时。但是,如果中间信号毫不延迟,则可以解决出现的问题,因此鉴于这些信号,系统的长期行为是静止的。为了模拟这种情况,我们介绍了带有中间观察的随机,非平稳,延迟的匪徒的问题。我们基于UCRL开发了一种计算高效的算法,并证明了Sublrinear后悔保证其性能。实验结果表明,我们的方法能够在现有方法失败的非平稳延迟环境中学习。
Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics. While mitigating the effects of delays in learning is well-understood in stationary environments, the problem becomes much more challenging when the environment changes. In fact, if the timescale of the change is comparable to the delay, it is impossible to learn about the environment, since the available observations are already obsolete. However, the arising issues can be addressed if intermediate signals are available without delay, such that given those signals, the long-term behavior of the system is stationary. To model this situation, we introduce the problem of stochastic, non-stationary, delayed bandits with intermediate observations. We develop a computationally efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance. Experimental results demonstrate that our method is able to learn in non-stationary delayed environments where existing methods fail.