论文标题
延迟奖励的游戏中的无梯度在线学习
Gradient-free Online Learning in Games with Delayed Rewards
论文作者
论文摘要
在在线广告和推荐系统的应用程序中,我们考虑了一种游戏理论模型,该模型具有延迟的奖励和异步,基于回报的反馈。与以前在延迟的多军匪徒的工作相反,我们专注于具有连续动作空间的多玩游戏游戏,并且我们研究了遵循无需重格学习政策的战略代理人的长期行为(否则却忽略了正在玩的游戏,对手的目标等)。为了说明缺乏一致的信息流(例如,奖励可以在订单中获得,而先验无限的延迟等),我们介绍了无坡度的学习政策,在该政策中,将收益信息放在其到达时的优先排队中。在这种一般环境中,我们为代理商的遗憾增添了新的界限。此外,在标准的对角线凹度假设下,我们表明,即使选择动作和接收相应的奖励之间的延迟是无界的,诱导的播放顺序以$ 1 $的概率收敛到NASH平衡。
Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability $1$, even if the delay between choosing an action and receiving the corresponding reward is unbounded.