随机Stackelberg安全游戏的无模型增强学习

论文标题

随机Stackelberg安全游戏的无模型增强学习

Model-free Reinforcement Learning for Stochastic Stackelberg Security Games

论文作者

Mishra, Rajesh K, Vasal, Deepanshu, Vishwanath, Sriram

论文摘要

在本文中，我们考虑了一场与两个玩家，一个领导者和一个追随者的连续随机stackelberg游戏。追随者可以访问系统状态，而领导者则无法使用。假设球员以各自的最大利益行动，追随者的策略是对领导者的战略做出最佳回应。在这种情况下，领导者的优势是承诺一项政策，鉴于追随者将对其政策发挥最佳回应，该政策最大化了自己的回报。因此，两位玩家都聚集了一对构成游戏平衡的政策。最近，〜[1]提供了一种顺序分解算法来计算此类游戏的Stackelberg平衡，该游戏允许在线性时间内计算马尔可夫平衡策略，而不是以前，而不是双向指数。在本文中，我们将这个想法扩展到了一个MDP，该MDP的动态是播放器不知道的，以基于预期的SARSA提出RL算法，该算法通过模拟MDP的模型来了解Stackelberg平衡策略。我们使用粒子过滤器来估算一个普通代理的信念更新，该通用代理根据两个玩家共有的信息计算最佳策略。我们提出了一个安全游戏示例，以说明我们算法所学的策略。通过模拟MDP的模型。我们使用粒子过滤器来估算一个普通代理的信念更新，该通用代理根据两个玩家共有的信息计算最佳策略。我们提出了一个安全游戏示例，以说明我们算法所学的策略。

In this paper, we consider a sequential stochastic Stackelberg game with two players, a leader and a follower. The follower has access to the state of the system while the leader does not. Assuming that the players act in their respective best interests, the follower's strategy is to play the best response to the leader's strategy. In such a scenario, the leader has the advantage of committing to a policy which maximizes its own returns given the knowledge that the follower is going to play the best response to its policy. Thus, both players converge to a pair of policies that form the Stackelberg equilibrium of the game. Recently,~[1] provided a sequential decomposition algorithm to compute the Stackelberg equilibrium for such games which allow for the computation of Markovian equilibrium policies in linear time as opposed to double exponential, as before. In this paper, we extend the idea to an MDP whose dynamics are not known to the players, to propose an RL algorithm based on Expected Sarsa that learns the Stackelberg equilibrium policy by simulating a model of the MDP. We use particle filters to estimate the belief update for a common agent which computes the optimal policy based on the information which is common to both the players. We present a security game example to illustrate the policy learned by our algorithm. by simulating a model of the MDP. We use particle filters to estimate the belief update for a common agent which computes the optimal policy based on the information which is common to both the players. We present a security game example to illustrate the policy learned by our algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题