基于实例的增强学习概括

论文标题

基于实例的增强学习概括

Instance based Generalization in Reinforcement Learning

论文作者

Bertran, Martin, Martinez, Natalia, Phielipp, Mariano, Sapiro, Guillermo

论文摘要

通过深入加固学习（RL）训练的代理商通常无法概括地看不见的环境，即使这些环境具有与训练水平相同的基本动态。了解RL的概括属性是现代机器学习的挑战之一。为了实现这一目标，我们在可观察到的马尔可夫决策过程（POMDP）的背景下分析了政策学习，并将培训水平的动态形式化为实例。我们证明，与探索策略无关，重复使用实例引入了代理商在训练过程中观察到的有效马尔可夫动态的重大变化。最大化预期奖励会通过诱导不希望实例的特定速度制定政策而不是可推广的政策来影响代理商的信念状态，而这些政策在培训集中是次优的。我们根据培训实例的数量为火车和测试环境中的价值差距提供概括范围，并根据这些实例使用见解来提高看不见的水平的性能。我们建议培训对专业政策集合的共同信念表示，我们从中计算出一种共识策略，该策略用于数据收集，不掩盖实例的特定利用。我们通过实验验证了我们在共同基准测试的理论，观察结果和提出的计算解决方案。

Agents trained via deep reinforcement learning (RL) routinely fail to generalize to unseen environments, even when these share the same underlying dynamics as the training levels. Understanding the generalization properties of RL is one of the challenges of modern machine learning. Towards this goal, we analyze policy learning in the context of Partially Observable Markov Decision Processes (POMDPs) and formalize the dynamics of training levels as instances. We prove that, independently of the exploration strategy, reusing instances introduces significant changes on the effective Markov dynamics the agent observes during training. Maximizing expected rewards impacts the learned belief state of the agent by inducing undesired instance specific speedrunning policies instead of generalizeable ones, which are suboptimal on the training set. We provide generalization bounds to the value gap in train and test environments based on the number of training instances, and use insights based on these to improve performance on unseen levels. We propose training a shared belief representation over an ensemble of specialized policies, from which we compute a consensus policy that is used for data collection, disallowing instance specific exploitation. We experimentally validate our theory, observations, and the proposed computational solution over the CoinRun benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题