论文标题
纠正经验重播多代理通信
Correcting Experience Replay for Multi-Agent Communication
论文作者
论文摘要
我们考虑使用多代理强化学习(MARL)学习进行交流的问题。一种常见的方法是使用从重播缓冲区采样的数据来学习非政策。但是,过去收到的消息可能无法准确反映每个代理的当前通信政策,这会使学习复杂化。因此,我们引入了“沟通校正”,该“交流校正”解释了由多机构学习引起的观察到的沟通的非平稳性。它通过重新标记收到的消息来使其在沟通者的当前政策下可能使其可能更好地反映接收者当前环境。为了说明代理人是发件人和接收者的案例,我们引入了有序的重新标记计划。我们的校正在计算上是有效的,并且可以与一系列非政策算法集成。我们在实验中发现,它大大提高了交流MARL系统在各种合作和竞争任务中学习的能力。
We consider the problem of learning to communicate using multi-agent reinforcement learning (MARL). A common approach is to learn off-policy, using data sampled from a replay buffer. However, messages received in the past may not accurately reflect the current communication policy of each agent, and this complicates learning. We therefore introduce a 'communication correction' which accounts for the non-stationarity of observed communication induced by multi-agent learning. It works by relabelling the received message to make it likely under the communicator's current policy, and thus be a better reflection of the receiver's current environment. To account for cases in which agents are both senders and receivers, we introduce an ordered relabelling scheme. Our correction is computationally efficient and can be integrated with a range of off-policy algorithms. We find in our experiments that it substantially improves the ability of communicating MARL systems to learn across a variety of cooperative and competitive tasks.