同型政策镜下降：策略融合，隐性正则化和改善样本复杂性

论文标题

同型政策镜下降：策略融合，隐性正则化和改善样本复杂性

Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

论文作者

Li, Yan, Lan, Guanghui, Zhao, Tuo

论文摘要

我们提出了一种新的政策梯度方法，称为同型政策镜下降（HPMD），用于解决具有有限状态和行动空间的折扣，无限的地平线MDP。 HPMD执行镜像下降类型策略更新，并具有附加的正则化项，并具有文献中似乎是新的几个计算属性。我们首先建立了使用Kullback-Leibler Divergence实例化的HPMD的全局线性收敛，对于最佳差距，以及与最佳策略集的加权距离。然后，在没有任何假设的情况下，获得了局部超线性收敛。随着局部加速和正则化的减少，我们通过以非反应性表征来证明最后介绍策略将最大的最佳策略收敛到具有最大熵。然后，我们将上述所有结果扩展到HPMD与广泛的可分离Bregman差异的实例化，证明了这些计算属性的一般性。作为产品，我们发现了一些常用的Bregman差异的有限时间精确收敛，这意味着即使当前的策略已经是最佳的，即使HPMD持续融合到限制策略。最后，我们开发了HPMD的随机版本，并建立了类似的收敛属性。 By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / ε^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.

We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / ε^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题