论文标题
POMDPS中预期的总奖励不足
Under-Approximating Expected Total Rewards in POMDPs
论文作者
论文摘要
我们考虑了问题:在给定阈值以下的马尔可夫决策过程(POMDP)中,最佳的预期总奖励是达到目标状态的最佳预期奖励吗?我们通过计算这些预期奖励的评估不足来解决这个问题 - 通常是不确定的 - 问题。这是通过提取POMDP的无限信念MDP的有限展开来完成的。关键问题是找到对价值函数的合适不足的不足。我们提供了两种技术:一种简单的(截止)技术,该技术使用POMDP上的良好政策,以及一种更先进的技术(信念剪裁),该技术使用信念之间概率的最小变化。我们使用混合企业线性编程(MILP)来找到这种最小的概率变化,并在实验上表明我们的技术缩放得很好,同时为预期的总奖励提供了紧密的下限。
We consider the problem: is the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) below a given threshold? We tackle this -- generally undecidable -- problem by computing under-approximations on these total expected rewards. This is done by abstracting finite unfoldings of the infinite belief MDP of the POMDP. The key issue is to find a suitable under-approximation of the value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs. We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well while providing tight lower bounds on the expected total reward.