世界价值功能：多任务加固学习的知识表示

论文标题

世界价值功能：多任务加固学习的知识表示

World Value Functions: Knowledge Representation for Multitask Reinforcement Learning

论文作者

Tasse, Geraud Nangue, James, Steven, Rosman, Benjamin

论文摘要

人工智能中的一个开放问题是如何学习和代表知识，这对于需要解决给定世界中多个任务的普通代理就足够了。在这项工作中，我们提出了世界价值功能（WVF），它是一种掌握世界的一般价值函数 - 它们不仅代表了如何解决给定任务，还代表了如何解决任何其他目标任务。为了实现这一目标，我们为代理提供了一个内部目标空间，该目标空间定义为所有经历终端过渡的世界各州 - 任务结果。然后，代理可以修改任务奖励以定义其自身的奖励功能，事实证明它可以驱动其学习如何实现所有可实现的内部目标，以及在当前任务中的价值。我们证明了WVF的许多好处。当代理的内部目标空间是整个状态空间时，我们证明可以从学习的WVF中推断出过渡功能，该函数允许代理使用学习的值函数来计划。此外，我们表明，对于同一世界中的任务，一位验证的代理商，它已经学会了任何WVF，然后可以直接从其奖励中推断出任何新任务的策略和价值函数。最后，长寿代理的重要属性是重复使用现有知识来解决新任务的能力。使用WVF作为学习任务的知识表示，我们表明代理能够解决其逻辑组合零射击，从而在整个生命周期中增加了组合的技能数量。

An open problem in artificial intelligence is how to learn and represent knowledge that is sufficient for a general agent that needs to solve multiple tasks in a given world. In this work we propose world value functions (WVFs), which are a type of general value function with mastery of the world - they represent not only how to solve a given task, but also how to solve any other goal-reaching task. To achieve this, we equip the agent with an internal goal space defined as all the world states where it experiences a terminal transition - a task outcome. The agent can then modify task rewards to define its own reward function, which provably drives it to learn how to achieve all achievable internal goals, and the value of doing so in the current task. We demonstrate a number of benefits of WVFs. When the agent's internal goal space is the entire state space, we demonstrate that the transition function can be inferred from the learned WVF, which allows the agent to plan using learned value functions. Additionally, we show that for tasks in the same world, a pretrained agent that has learned any WVF can then infer the policy and value function for any new task directly from its rewards. Finally, an important property for long-lived agents is the ability to reuse existing knowledge to solve new tasks. Using WVFs as the knowledge representation for learned tasks, we show that an agent is able to solve their logical combination zero-shot, resulting in a combinatorially increasing number of skills throughout their lifetime.

下载PDF全文

下载文献需遵守相关版权规定

论文标题