通过随机搜索的对抗性模仿学习

论文标题

通过随机搜索的对抗性模仿学习

Adversarial Imitation Learning via Random Search

论文作者

Shin, MyungJae, Kim, Joongheon

论文摘要

开发可以执行具有挑战性的复杂任务的代理是加强学习的目标。无模型的增强学习被认为是可行的解决方案。但是，最先进的研究是开发越来越复杂的技术。这种增加的复杂性使重建变得困难。此外，奖励依赖的问题仍然存在。结果，从专家的演示中学习政策的模仿学习研究开始引起人们的关注。模仿学习直接根据有关专家行为的数据来学习政策，而没有环境提供的明确奖励信号。但是，模仿学习试图基于深入的强化学习（例如信任区域政策优化）来优化政策。结果，基于强化的基于学习的模仿学习也构成了可重复性的危机。复杂模型模型的问题受到了相当大的关注。基于无衍生优化的强化学习和策略的简化获得了动态复杂任务的竞争性能。简化的策略和衍生自由方法使算法变得简单。研究演示的重新配置变得容易。在本文中，我们提出了一种模仿学习方法，该方法利用简单的线性策略利用无衍生的优化。所提出的方法在策略的参数空间中进行了简单的随机搜索，并显示了计算效率。本文中的实验表明，提出的模型没有来自环境的直接奖励信号，可以在穆约科（Mujoco）运动任务上获得竞争性能。

Developing agents that can perform challenging complex tasks is the goal of reinforcement learning. The model-free reinforcement learning has been considered as a feasible solution. However, the state of the art research has been to develop increasingly complicated techniques. This increasing complexity makes the reconstruction difficult. Furthermore, the problem of reward dependency is still exists. As a result, research on imitation learning, which learns policy from a demonstration of experts, has begun to attract attention. Imitation learning directly learns policy based on data on the behavior of the experts without the explicit reward signal provided by the environment. However, imitation learning tries to optimize policies based on deep reinforcement learning such as trust region policy optimization. As a result, deep reinforcement learning based imitation learning also poses a crisis of reproducibility. The issue of complex model-free model has received considerable critical attention. A derivative-free optimization based reinforcement learning and the simplification on policies obtain competitive performance on the dynamic complex tasks. The simplified policies and derivative free methods make algorithm be simple. The reconfiguration of research demo becomes easy. In this paper, we propose an imitation learning method that takes advantage of the derivative-free optimization with simple linear policies. The proposed method performs simple random search in the parameter space of policies and shows computational efficiency. Experiments in this paper show that the proposed model, without a direct reward signal from the environment, obtains competitive performance on the MuJoCo locomotion tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题