论文标题
基于人群的奖励条件神经运动原始人的差异政策优化
Reward Conditioned Neural Movement Primitives for Population Based Variational Policy Optimization
论文作者
论文摘要
本文的目的是在监督学习方法中研究基于奖励的政策探索问题,并使机器人能够在挑战性的奖励环境和搜索空间中形成复杂的运动轨迹。为此,可以通过展示的轨迹引导的机器人的体验来训练一种新型的基于神经过程的深网,该网络从其潜在空间中采样,并生成所需的轨迹,给定所需的奖励。我们的框架可以通过从高奖励景观中取样,从而逐渐增加奖励,从而逐渐改善轨迹。变分推断用于创建一个随机潜在空间,以在给定目标奖励的产生轨迹种群中采样不同的轨迹。我们从进化策略中受益,并提出了一种新颖的交叉操作,该操作应用于单个政策的自组织潜在空间,允许将可能解决奖励功能中不同因素的个体融合。使用许多需要顺序到达多个点或通过对象之间的差距的任务,我们表明我们的方法提供了稳定的学习进度和显着的样本效率,与许多最新的机器人加固学习方法相比。最后,我们通过涉及避免障碍物的真实机器人执行来展示我们方法的现实适用性。
The aim of this paper is to study the reward based policy exploration problem in a supervised learning approach and enable robots to form complex movement trajectories in challenging reward settings and search spaces. For this, the experience of the robot, which can be bootstrapped from demonstrated trajectories, is used to train a novel Neural Processes-based deep network that samples from its latent space and generates the required trajectories given desired rewards. Our framework can generate progressively improved trajectories by sampling them from high reward landscapes, increasing the reward gradually. Variational inference is used to create a stochastic latent space to sample varying trajectories in generating population of trajectories given target rewards. We benefit from Evolutionary Strategies and propose a novel crossover operation, which is applied in the self-organized latent space of the individual policies, allowing blending of the individuals that might address different factors in the reward function. Using a number of tasks that require sequential reaching to multiple points or passing through gaps between objects, we showed that our method provides stable learning progress and significant sample efficiency compared to a number of state-of-the-art robotic reinforcement learning methods. Finally, we show the real-world suitability of our method through real robot execution involving obstacle avoidance.