论文标题
在随机游戏中流畅的虚拟游戏,带有干扰的回报和未知的过渡
Smooth Fictitious Play in Stochastic Games with Perturbed Payoffs and Unknown Transitions
论文作者
论文摘要
在静态游戏中著名的虚拟游戏学习程序的动态游戏的最新扩展被证明在全球范围内融合了固定的Nash Equilibria在两个重要类别的动态游戏中(零 - 零和相同利用的折扣随机游戏)。但是,那些分散的算法需要玩家准确了解模型(在每个阶段的过渡概率及其收益)。为了克服这些强有力的假设,我们的论文介绍了(Leslie 2020; Baudin 2022)中的系统的正常化,以构建一个无模型的新分散学习算法的家庭(玩家不知道过渡及其收益在每个阶段都在扰乱)。我们的程序可以看作是对静态游戏中经典平稳的虚拟游戏学习程序的随机游戏的扩展(在静态游戏中,最佳响应是正规化的,这要归功于他们的回报功能的严格凹进)。我们证明了我们的程序家族与零和相同折扣随机游戏的固定正规化纳什均衡的融合。证明使用连续平滑的最佳响应动力学对应物和随机近似方法。当只有一个玩家时,我们的问题是加强学习的实例,我们的过程被证明是在全球范围内融合了正规MDP的最佳平稳政策。从这个意义上讲,它们可以看作是众所周知的Q学习程序的替代方法。
Recent extensions to dynamic games of the well-known fictitious play learning procedure in static games were proved to globally converge to stationary Nash equilibria in two important classes of dynamic games (zero-sum and identical-interest discounted stochastic games). However, those decentralized algorithms need the players to know exactly the model (the transition probabilities and their payoffs at every stage). To overcome these strong assumptions, our paper introduces regularizations of the systems in (Leslie 2020; Baudin 2022) to construct a family of new decentralized learning algorithms which are model-free (players don't know the transitions and their payoffs are perturbed at every stage). Our procedures can be seen as extensions to stochastic games of the classical smooth fictitious play learning procedures in static games (where the players best responses are regularized, thanks to a smooth strictly concave perturbation of their payoff functions). We prove the convergence of our family of procedures to stationary regularized Nash equilibria in zero-sum and identical-interest discounted stochastic games. The proof uses the continuous smooth best-response dynamics counterparts, and stochastic approximation methods. When there is only one player, our problem is an instance of Reinforcement Learning and our procedures are proved to globally converge to the optimal stationary policy of the regularized MDP. In that sense, they can be seen as an alternative to the well known Q-learning procedure.