论文标题
通过观察和学习世界的运作方式来发现与土匪一起计划的生活技能
Discover Life Skills for Planning with Bandits via Observing and Learning How the World Works
论文作者
论文摘要
我们提出了一种新颖的方法,以计划代理人通过观察和学习与世界的历史互动来构成抽象技能。我们的框架通过在未知的前条件下通过一组动作在马尔可夫州空间模型中运行。我们将技能作为高级抽象政策,根据当前状态提出行动计划。每个政策在代理人与世界互动时观察各州的过渡来学习新计划。这种方法会自动学习新的计划以实现特定的预期效果,但是这种计划的成功通常取决于它们适用的状态。因此,我们制定了对无限多部多军匪徒问题等计划的评估,在这些计划中,我们在评估现有武器的成功概率和探索新选项的成功概率上平衡了资源的分配。结果是一个计划者,能够在嘈杂的环境下自动学习强大的高级技能。这样的技能隐含地学习了没有明确知识的行动前提。我们表明,在高维状态空间域中,这种计划方法在实验上非常有竞争力。
We propose a novel approach for planning agents to compose abstract skills via observing and learning from historical interactions with the world. Our framework operates in a Markov state-space model via a set of actions under unknown pre-conditions. We formulate skills as high-level abstract policies that propose action plans based on the current state. Each policy learns new plans by observing the states' transitions while the agent interacts with the world. Such an approach automatically learns new plans to achieve specific intended effects, but the success of such plans is often dependent on the states in which they are applicable. Therefore, we formulate the evaluation of such plans as infinitely many multi-armed bandit problems, where we balance the allocation of resources on evaluating the success probability of existing arms and exploring new options. The result is a planner capable of automatically learning robust high-level skills under a noisy environment; such skills implicitly learn the action pre-condition without explicit knowledge. We show that this planning approach is experimentally very competitive in high-dimensional state space domains.