在MDP中学习非马克维亚奖励模型

论文标题

在MDP中学习非马克维亚奖励模型

Learning Non-Markovian Reward Models in MDPs

论文作者

Rens, Gavin, Raskin, Jean-François

论文摘要

在某些情况下，代理只有在完成一系列以前的任务后才能获得奖励。换句话说，代理商获得的奖励是非马克维亚人。代表与历史相关的奖励的一种自然而相当普遍的方法是通过一台Mealy Machine。一个有限状态自动机，从输入序列（在我们的情况下进行状态/动作观察）产生输出序列（在我们的情况下奖励）。在我们的正式环境中，我们考虑了马尔可夫决策过程（MDP），该过程模拟了代理商演变的环境的动态和与此MDP同步的Mealy Machine，以形式化非马克维亚奖励功能。尽管MDP是由代理知道的，但奖励功能是从代理商中知道的，必须学习。学习非马尔科夫奖励功能是一个挑战。我们克服这个具有挑战性问题的方法是仔细组合符合有限自动机的主动学习算法，测试技术的测试技术，以确立有限模型假设的一致性和计算Markovian（即时）奖励MDP中最佳策略的优化技术。我们还展示了如何将我们的框架与诸如蒙特卡洛树搜索之类的经典启发式方法结合在一起。我们在两个典型的AI示例中说明了我们的算法和初步实现。

There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks. In other words, the reward that the agent receives is non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine; a finite state automaton that produces output sequences (rewards in our case) from input sequences (state/action observations in our case). In our formal setting, we consider a Markov decision process (MDP) that models the dynamic of the environment in which the agent evolves and a Mealy machine synchronised with this MDP to formalise the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown from the agent and must be learnt. Learning non-Markov reward functions is a challenge. Our approach to overcome this challenging problem is a careful combination of the Angluin's L* active learning algorithm to learn finite automata, testing techniques for establishing conformance of finite model hypothesis and optimisation techniques for computing optimal strategies in Markovian (immediate) reward MDPs. We also show how our framework can be combined with classical heuristics such as Monte Carlo Tree Search. We illustrate our algorithms and a preliminary implementation on two typical examples for AI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题