降低维度的降低和优先级探索以进行政策搜索

论文标题

降低维度的降低和优先级探索以进行政策搜索

Dimensionality Reduction and Prioritized Exploration for Policy Search

论文作者

Memmel, Marius, Liu, Puze, Tateo, Davide, Peters, Jan

论文摘要

Black-Box策略优化是一类强化学习算法，可在参数级别探索和更新策略。这类算法被广泛应用于具有运动原语或非差异策略的机器人技术中。此外，这些方法在行动水平上探索可能造成执行器损坏或其他安全问题时特别重要。但是，随着策略的不断增长，黑盒优化的扩展不是很好，导致对样本的需求量很高，这些样本在现实世界中很昂贵。在许多实际应用中，策略参数并不同等贡献回报。确定最相关的参数可以使探索范围缩小并加快学习速度。此外，仅更新有效参数需要更少的样本，从而提高方法的可扩展性。我们提出了一种新的方法，可以优先考虑探索有效参数并应对完整的协方差矩阵更新。我们的算法比最近的方法更快地学习，并且需要更少的样本才能获得最新的结果。为了选择有效参数，我们同时考虑Pearson相关系数和相互信息。我们在几种模拟环境中（包括机器人模拟）中，在相对熵策略搜索算法上展示了我们方法的功能。代码可在https://git.ias.informatik.tu-darmstadt.de/ias \ _code/aistats2022/dr-creps} {git.ias.informatik.tu-darmstadt.de/ias \ asias \ _code/aistats202222/dr-creps。

Black-box policy optimization is a class of reinforcement learning algorithms that explores and updates the policies at the parameter level. This class of algorithms is widely applied in robotics with movement primitives or non-differentiable policies. Furthermore, these approaches are particularly relevant where exploration at the action level could cause actuator damage or other safety issues. However, Black-box optimization does not scale well with the increasing dimensionality of the policy, leading to high demand for samples, which are expensive to obtain in real-world systems. In many practical applications, policy parameters do not contribute equally to the return. Identifying the most relevant parameters allows to narrow down the exploration and speed up the learning. Furthermore, updating only the effective parameters requires fewer samples, improving the scalability of the method. We present a novel method to prioritize the exploration of effective parameters and cope with full covariance matrix updates. Our algorithm learns faster than recent approaches and requires fewer samples to achieve state-of-the-art results. To select the effective parameters, we consider both the Pearson correlation coefficient and the Mutual Information. We showcase the capabilities of our approach on the Relative Entropy Policy Search algorithm in several simulated environments, including robotics simulations. Code is available at https://git.ias.informatik.tu-darmstadt.de/ias\_code/aistats2022/dr-creps}{git.ias.informatik.tu-darmstadt.de/ias\_code/aistats2022/dr-creps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题