延迟奖励的游戏中的无梯度在线学习

论文标题

延迟奖励的游戏中的无梯度在线学习

Gradient-free Online Learning in Games with Delayed Rewards

论文作者

Héliou, Amélie, Mertikopoulos, Panayotis, Zhou, Zhengyuan

论文摘要

在在线广告和推荐系统的应用程序中，我们考虑了一种游戏理论模型，该模型具有延迟的奖励和异步，基于回报的反馈。与以前在延迟的多军匪徒的工作相反，我们专注于具有连续动作空间的多玩游戏游戏，并且我们研究了遵循无需重格学习政策的战略代理人的长期行为（否则却忽略了正在玩的游戏，对手的目标等）。为了说明缺乏一致的信息流（例如，奖励可以在订单中获得，而先验无限的延迟等），我们介绍了无坡度的学习政策，在该政策中，将收益信息放在其到达时的优先排队中。在这种一般环境中，我们为代理商的遗憾增添了新的界限。此外，在标准的对角线凹度假设下，我们表明，即使选择动作和接收相应的奖励之间的延迟是无界的，诱导的播放顺序以$ 1 $的概率收敛到NASH平衡。

Motivated by applications to online advertising and recommender systems, we consider a game-theoretic model with delayed rewards and asynchronous, payoff-based feedback. In contrast to previous work on delayed multi-armed bandits, we focus on multi-player games with continuous action spaces, and we examine the long-run behavior of strategic agents that follow a no-regret learning policy (but are otherwise oblivious to the game being played, the objectives of their opponents, etc.). To account for the lack of a consistent stream of information (for instance, rewards can arrive out of order, with an a priori unbounded delay, etc.), we introduce a gradient-free learning policy where payoff information is placed in a priority queue as it arrives. In this general context, we derive new bounds for the agents' regret; furthermore, under a standard diagonal concavity assumption, we show that the induced sequence of play converges to Nash equilibrium with probability $1$, even if the delay between choosing an action and receiving the corresponding reward is unbounded.

下载PDF全文

下载文献需遵守相关版权规定

论文标题