POMDPS中预期的总奖励不足

论文标题

POMDPS中预期的总奖励不足

Under-Approximating Expected Total Rewards in POMDPs

论文作者

Bork, Alexander, Katoen, Joost-Pieter, Quatmann, Tim

论文摘要

我们考虑了问题：在给定阈值以下的马尔可夫决策过程（POMDP）中，最佳的预期总奖励是达到目标状态的最佳预期奖励吗？我们通过计算这些预期奖励的评估不足来解决这个问题 - 通常是不确定的 - 问题。这是通过提取POMDP的无限信念MDP的有限展开来完成的。关键问题是找到对价值函数的合适不足的不足。我们提供了两种技术：一种简单的（截止）技术，该技术使用POMDP上的良好政策，以及一种更先进的技术（信念剪裁），该技术使用信念之间概率的最小变化。我们使用混合企业线性编程（MILP）来找到这种最小的概率变化，并在实验上表明我们的技术缩放得很好，同时为预期的总奖励提供了紧密的下限。

We consider the problem: is the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) below a given threshold? We tackle this -- generally undecidable -- problem by computing under-approximations on these total expected rewards. This is done by abstracting finite unfoldings of the infinite belief MDP of the POMDP. The key issue is to find a suitable under-approximation of the value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs. We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well while providing tight lower bounds on the expected total reward.

下载PDF全文

下载文献需遵守相关版权规定

论文标题