通过封闭形式的改进操作员，离线加强学习

论文标题

通过封闭形式的改进操作员，离线加强学习

Offline Reinforcement Learning with Closed-Form Policy Improvement Operators

论文作者

Li, Jiachen, Zhang, Edwin, Yin, Ming, Bai, Qinxun, Wang, Yu-Xiang, Wang, William Yang

论文摘要

行为限制的政策优化已被证明是解决离线增强学习的成功范式。通过利用历史过渡，培训了一项政策，以最大程度地提高学习价值功能，同时受到行为策略的约束，以避免重大的分配转移。在本文中，我们提出了封闭形式的政策改进运营商。我们做出了一个新颖的观察，即行为约束自然激励使用一阶泰勒近似，从而导致策略目标的线性近似。此外，由于实际数据集通常是通过异构策略收集的，因此我们将行为策略建模为高斯混合物，并通过利用Logsumexp的下限和Jensen的不平等来克服诱发的优化困难，从而导致封闭形式的策略改善者。我们通过新颖的政策改进运营商实例化RL算法实例化，并经验证明了它们对标准D4RL基准的最先进算法的有效性。我们的代码可从https://cfpi-icml23.github.io/获得。

Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning. By exploiting historical transitions, a policy is trained to maximize a learned value function while constrained by the behavior policy to avoid a significant distributional shift. In this paper, we propose our closed-form policy improvement operators. We make a novel observation that the behavior constraint naturally motivates the use of first-order Taylor approximation, leading to a linear approximation of the policy objective. Additionally, as practical datasets are usually collected by heterogeneous policies, we model the behavior policies as a Gaussian Mixture and overcome the induced optimization difficulties by leveraging the LogSumExp's lower bound and Jensen's Inequality, giving rise to a closed-form policy improvement operator. We instantiate offline RL algorithms with our novel policy improvement operators and empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark. Our code is available at https://cfpi-icml23.github.io/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题