公正的深入强化学习：现有算法和未来算法的一般培训框架

论文标题

公正的深入强化学习：现有算法和未来算法的一般培训框架

Unbiased Deep Reinforcement Learning: A General Training Framework for Existing and Future Algorithms

论文作者

Zhang, Huihui, Huang, Wu

论文摘要

近年来，深层神经网络已成功地应用于增强学习的领域\ cite {Bengio2009Learning，Krizhevsky2012imagenet，Hinton2006Reducing}。据报道，深度强化学习\ cite {mnih2015human}具有直接从高维感觉输入中学习有效政策的优势。但是，在文献范围内，现有的培训框架没有任何基本变化或改进。在这里，我们提出了一个新颖的培训框架，该培训框架在概念上是可以理解的，并且可能很容易被推广到所有可行的增强学习算法。我们采用蒙特卡洛采样来实现原始数据输入，并分批训练它们以实现马尔可夫决策过程序列并同步更新网络参数，而不是经验重播。该训练框架证明是为了优化损失函数的公正近似值，其估计与实际的概率分布数据输入完全匹配，因此在评估离散动作空间和持续控制问题之后，与现有深度强化学习相比，样本效率和收敛速度的优势与现有深度强化学习相比具有压倒性的优势。此外，我们提出了几种嵌入新框架的算法，以处理典型的离散和连续场景。这些算法比在深度强化学习的框架下的原始版本相比，这些算法效率要高得多，并为现有算法和将来的算法提供了示例，以推广我们的新框架。

In recent years deep neural networks have been successfully applied to the domains of reinforcement learning \cite{bengio2009learning,krizhevsky2012imagenet,hinton2006reducing}. Deep reinforcement learning \cite{mnih2015human} is reported to have the advantage of learning effective policies directly from high-dimensional sensory inputs over traditional agents. However, within the scope of the literature, there is no fundamental change or improvement on the existing training framework. Here we propose a novel training framework that is conceptually comprehensible and potentially easy to be generalized to all feasible algorithms for reinforcement learning. We employ Monte-carlo sampling to achieve raw data inputs, and train them in batch to achieve Markov decision process sequences and synchronously update the network parameters instead of experience replay. This training framework proves to optimize the unbiased approximation of loss function whose estimation exactly matches the real probability distribution data inputs follow, and thus have overwhelming advantages of sample efficiency and convergence rate over existing deep reinforcement learning after evaluating it on both discrete action spaces and continuous control problems. Besides, we propose several algorithms embedded with our new framework to deal with typical discrete and continuous scenarios. These algorithms prove to be far more efficient than their original versions under the framework of deep reinforcement learning, and provide examples for existing and future algorithms to generalize to our new framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题