关于在紧急沟通中的监督与自我玩法之间的相互作用

论文标题

关于在紧急沟通中的监督与自我玩法之间的相互作用

On the interaction between supervision and self-play in emergent communication

论文作者

Lowe, Ryan, Gupta, Abhinav, Foerster, Jakob, Kiela, Douwe, Pineau, Joelle

论文摘要

一种有前途的教导使用自然语言的有前途的方法涉及使用人类的训练。但是，最近的工作表明，当前的机器学习方法的数据效率过低，无法以这种方式从头开始培训。在本文中，我们调查了两种学习信号之间的关系，其最终目标是提高样本效率：通过监督学习模仿人类语言数据，并通过自我播放（如在紧急交流中完成）在模拟的多代理环境中最大程度地奖励，并为使用这些信号的algorith介绍术语受监督的自我播放（S2P）术语。我们发现，通过对人类数据的监督学习，自我玩法的第一个培训代理超过了相反的表现，这表明从头开始出现语言是没有好处的。然后，我们从经验上研究了各种S2P计划，这些计划从两个环境中的监督学习开始：带有符号输入的刘易斯信号游戏，以及具有自然语言描述的基于图像的参考游戏。最后，我们将基于人群的方法引入了S2P，这进一步改善了单一方法方法的性能。

A promising approach for teaching artificial agents to use natural language involves using human-in-the-loop training. However, recent work suggests that current machine learning methods are too data inefficient to be trained in this way from scratch. In this paper, we investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency: imitating human language data via supervised learning, and maximizing reward in a simulated multi-agent environment via self-play (as done in emergent communication), and introduce the term supervised self-play (S2P) for algorithms using both of these signals. We find that first training agents via supervised learning on human data followed by self-play outperforms the converse, suggesting that it is not beneficial to emerge languages from scratch. We then empirically investigate various S2P schedules that begin with supervised learning in two environments: a Lewis signaling game with symbolic inputs, and an image-based referential game with natural language descriptions. Lastly, we introduce population based approaches to S2P, which further improves the performance over single-agent methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题