论文标题

半:通过多感官不一致的自我监督探索

SEMI: Self-supervised Exploration via Multisensory Incongruity

论文作者

Wang, Jianren, Zhuang, Ziwen, Zhao, Hang

论文摘要

有效的探索是增强学习的长期问题,因为外部奖励通常很少或缺失。这个问题的一种流行解决方案是为代理提供新颖的信号作为内在奖励。在这项工作中,我们通过激励代理来最大化新的新颖性信号来介绍Semi,这是一种自我监督的探索政策:多感官不一致性,可以在两个方面衡量,感知不一致和动作不一致。前者代表了多感官输入的未对准,而后者表示在不同的感觉输入下代理策略的方差。具体而言,学会了一个比对预测器检测到多个感觉输入是否对齐,其误差用于测量感知不一致。策略模型将多感觉观察的不同组合作为输入,并输出探索的动作。动作的差异进一步用于衡量动作不一致。将两个不一致性作为内在的奖励,Semi允许代理商以自我监督的方式探索而没有任何外部奖励来学习技能。我们进一步表明,半决赛与外部奖励兼容,并提高了政策学习的样本效率。在包括对象操纵和视听游戏在内的各种基准环境中,都证明了半度性的有效性。

Efficient exploration is a long-standing problem in reinforcement learning since extrinsic rewards are usually sparse or missing. A popular solution to this issue is to feed an agent with novelty signals as intrinsic rewards. In this work, we introduce SEMI, a self-supervised exploration policy by incentivizing the agent to maximize a new novelty signal: multisensory incongruity, which can be measured in two aspects, perception incongruity and action incongruity. The former represents the misalignment of the multisensory inputs, while the latter represents the variance of an agent's policies under different sensory inputs. Specifically, an alignment predictor is learned to detect whether multiple sensory inputs are aligned, the error of which is used to measure perception incongruity. A policy model takes different combinations of the multisensory observations as input and outputs actions for exploration. The variance of actions is further used to measure action incongruity. Using both incongruities as intrinsic rewards, SEMI allows an agent to learn skills by exploring in a self-supervised manner without any external rewards. We further show that SEMI is compatible with extrinsic rewards and it improves sample efficiency of policy learning. The effectiveness of SEMI is demonstrated across a variety of benchmark environments including object manipulation and audio-visual games.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源