单任务多场钢筋学习的动态价值估计

论文标题

单任务多场钢筋学习的动态价值估计

Dynamic Value Estimation for Single-Task Multi-Scene Reinforcement Learning

论文作者

Singh, Jaskirat, Zheng, Liang

论文摘要

培训从同一任务的多个级别 /场景 /条件的环境中培训深厚的增强学习剂，对于旨在实现从模拟到现实世界的概括和域转移的许多应用程序至关重要。尽管这种策略有助于泛化，但多个场景的使用大大增加了为策略梯度计算所收集的样本的差异。当前方法继续将此场景集合视为具有共同价值函数的单个马尔可夫决策过程（MDP）。但是，我们认为最好将该系列视为具有多个基础MDP的单个环境。为此，我们为这些多重MDP环境提出了动态值估计（DVE）技术，这是由在不同场景中的值函数分布中观察到的聚类效应所激发的。最终的代理能够学习更准确和特定于场景的值函数估计（以及优势函数），从而导致样本差异较低。我们提出的方法很容易适应几种现有的实现（例如PPO，A3C），并为一系列Procgen环境和AI2-基于框架的视觉导航任务提供一致的改进。

Training deep reinforcement learning agents on environments with multiple levels / scenes / conditions from the same task, has become essential for many applications aiming to achieve generalization and domain transfer from simulation to the real world. While such a strategy is helpful with generalization, the use of multiple scenes significantly increases the variance of samples collected for policy gradient computations. Current methods continue to view this collection of scenes as a single Markov Decision Process (MDP) with a common value function; however, we argue that it is better to treat the collection as a single environment with multiple underlying MDPs. To this end, we propose a dynamic value estimation (DVE) technique for these multiple-MDP environments, motivated by the clustering effect observed in the value function distribution across different scenes. The resulting agent is able to learn a more accurate and scene-specific value function estimate (and hence the advantage function), leading to a lower sample variance. Our proposed approach is simple to accommodate with several existing implementations (like PPO, A3C) and results in consistent improvements for a range of ProcGen environments and the AI2-THOR framework based visual navigation task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题