论文标题
基于双向模型的策略优化
Bidirectional Model-based Policy Optimization
论文作者
论文摘要
基于模型的强化学习方法利用远期动态模型来支持计划和决策,但是,如果模型不准确,这可能会灾难性地失败。尽管有几种专门用于打击模型误差的方法,但单个正向模型的潜力仍然有限。在本文中,我们建议还构建一个向后的动态模型,以降低对正向模型预测中准确性的依赖。我们开发了一种新的方法,称为基于双向模型的策略优化(BMPO),以利用前向模型和向后模型来生成短分支推出以进行策略优化。此外,我们从理论上得出了返回差异的更严格的界限,这表明了BMPO仅使用正向模型的优越性。广泛的实验表明,在样本效率和渐近性能方面,BMPO的表现优于最先进的模型方法。
Model-based reinforcement learning approaches leverage a forward dynamics model to support planning and decision making, which, however, may fail catastrophically if the model is inaccurate. Although there are several existing methods dedicated to combating the model error, the potential of the single forward model is still limited. In this paper, we propose to additionally construct a backward dynamics model to reduce the reliance on accuracy in forward model predictions. We develop a novel method, called Bidirectional Model-based Policy Optimization (BMPO) to utilize both the forward model and backward model to generate short branched rollouts for policy optimization. Furthermore, we theoretically derive a tighter bound of return discrepancy, which shows the superiority of BMPO against the one using merely the forward model. Extensive experiments demonstrate that BMPO outperforms state-of-the-art model-based methods in terms of sample efficiency and asymptotic performance.