论文标题
多批评政策梯度优化的无人机协调的多机构增强学习
Multi-Agent Reinforcement Learning for Unmanned Aerial Vehicle Coordination by Multi-Critic Policy Gradient Optimization
论文作者
论文摘要
无人驾驶汽车(UAV)的开发方面的最新技术进步以及降低的采购成本使无人机舰队的应用在各种任务中都有吸引力。在农业,灾难管理,搜索和救援行动,商业和军事应用中,应用无人机舰队的优势源于其自主合作的能力。旨在优化基于神经网络的控制策略的多机构增强学习方法,例如表现最好的参与者 - 批判性策略梯度算法,努力有效地向后反向传播不同的奖励信号源的错误,并倾向于忽略有利可图的信号,同时忽略了先前学过的相似性的协调和剥削。我们提出了具有多个价值估计网络和新的优势功能的多批评策略优化体系结构,该功能优化了随机参与者策略网络以实现代理的最佳协调。因此,我们将算法应用于需要在物理基于物理的增强学习环境中进行多个无人机协作的几项任务。我们的方法为越来越多的代理商实现了稳定的政策网络更新和奖励信号开发的相似性。最终的政策实现了最佳的协调,并遵守诸如避免碰撞之类的约束。
Recent technological progress in the development of Unmanned Aerial Vehicles (UAVs) together with decreasing acquisition costs make the application of drone fleets attractive for a wide variety of tasks. In agriculture, disaster management, search and rescue operations, commercial and military applications, the advantage of applying a fleet of drones originates from their ability to cooperate autonomously. Multi-Agent Reinforcement Learning approaches that aim to optimize a neural network based control policy, such as the best performing actor-critic policy gradient algorithms, struggle to effectively back-propagate errors of distinct rewards signal sources and tend to favor lucrative signals while neglecting coordination and exploitation of previously learned similarities. We propose a Multi-Critic Policy Optimization architecture with multiple value estimating networks and a novel advantage function that optimizes a stochastic actor policy network to achieve optimal coordination of agents. Consequently, we apply the algorithm to several tasks that require the collaboration of multiple drones in a physics-based reinforcement learning environment. Our approach achieves a stable policy network update and similarity in reward signal development for an increasing number of agents. The resulting policy achieves optimal coordination and compliance with constraints such as collision avoidance.