论文标题
迈向适用的强化学习:通过政策合奏提高概括和样本效率
Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble
论文作者
论文摘要
强化学习(RL)算法在现实世界中取得成功,例如金融交易和逻辑系统,这是一项挑战,这是由于训练和评估之间的嘈杂观察和环境变化,这是一项挑战。因此,它需要高样本效率和概括来解决现实世界任务。但是,在这种情况下,直接应用典型的RL算法会导致性能差。考虑到集合方法在监督学习中的准确性和概括(SL)中的出色表现,我们设计了一种可靠且适用的方法,名为Ensemble近端策略优化(EPPO),该方法以端到端的方式学习集成策略。值得注意的是,EPPO结合了每个政策和整体策略,并同时优化两者。此外,EPPO在政策空间上采用了多样性增强的正则化,有助于普遍看不见的国家并促进探索。从理论上讲,我们证明了EPPO提高了勘探功效,并且通过对各种任务的全面实验评估,我们证明EPPO可实现更高的效率,并且与香草策略优化算法和其他集合方法相比,EPPO对现实世界的应用是可靠的。代码和补充材料可在https://seqml.github.io/eppo上找到。
It is challenging for reinforcement learning (RL) algorithms to succeed in real-world applications like financial trading and logistic system due to the noisy observation and environment shifting between training and evaluation. Thus, it requires both high sample efficiency and generalization for resolving real-world tasks. However, directly applying typical RL algorithms can lead to poor performance in such scenarios. Considering the great performance of ensemble methods on both accuracy and generalization in supervised learning (SL), we design a robust and applicable method named Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. Notably, EPPO combines each policy and the policy ensemble organically and optimizes both simultaneously. In addition, EPPO adopts a diversity enhancement regularization over the policy space which helps to generalize to unseen states and promotes exploration. We theoretically prove EPPO increases exploration efficacy, and through comprehensive experimental evaluations on various tasks, we demonstrate that EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods. Code and supplemental materials are available at https://seqml.github.io/eppo.