论文标题
在少数免费部署中学习一般世界模型
Learning General World Models in a Handful of Reward-Free Deployments
论文作者
论文摘要
建立能力的代理是深入加强学习(RL)的巨大挑战。实际上,我们概述了两个关键的Desiderata:1)为了促进概括,探索应该是任务不可知的; 2)为了促进可伸缩性,勘探政策应收集大量数据,而不代价高昂的集中培训。结合了这两个属性,我们介绍了无奖励的部署效率设置,这是RL研究的新范式。然后,我们提出了Cascade,这是一种在这种新环境中进行自我监督探索的新颖方法。 Cascade试图使用受贝叶斯活跃学习启发的信息理论目标来学习世界模型,从而通过与代理商的人群收集数据来学习世界模型。 Cascade通过一个新型的级联目标来最大程度地最大程度地提高人口采样的轨迹多样性来实现这一目标。我们为级联的理论直觉提供了我们在表格环境中所显示的级联反应,这对不考虑人口多样性的幼稚方法改善了。然后,我们证明级联收集了各种任务不合命斯液的数据集,并学习了将零射门概括为新颖,在Atari,Minigrid,Crafter和DM Control Suite上的新颖,看不见的下游任务的代理。代码和视频可在https://ycxuyingchen.github.io/cascade/
Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon naïve approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite. Code and videos are available at https://ycxuyingchen.github.io/cascade/