论文标题
OpenVidial:带有视觉上下文的大规模开放域对话数据集
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts
论文作者
论文摘要
当人类交谈时,演讲者会说什么很大取决于他所看到的。不幸的是,现有的对话模型仅基于前面的文本上下文产生对话说法,并且很少考虑视觉上下文。这是由于缺乏大规模的多模块对话数据集,以及与视觉上下文搭配的话语。在本文中,我们发布了{\ bf OpenVidial},这是一个大规模的多模块对话数据集。对话转弯和视觉上下文是从电影和电视连续剧中提取的,每个对话转弯都与发生的相应视觉上下文配对。 OpenVidial总共包含110万个对话,因此图像中存储了110万个视觉上下文。基于此数据集,我们提出了一个由编码器模型的家族,利用文本和视觉上下文,从从CNN提取到的粗粒图像特征到从更快的R-CNN提取的细粒对象特征。我们观察到,视觉信息可显着提高对话的生成质量,从而验证整合对话学习的多模式特征的必要性。我们的工作标志着大规模多模式对话学习的重要一步。
When humans converse, what a speaker will say next significantly depends on what he sees. Unfortunately, existing dialogue models generate dialogue utterances only based on preceding textual contexts, and visual contexts are rarely considered. This is due to a lack of a large-scale multi-module dialogue dataset with utterances paired with visual contexts. In this paper, we release {\bf OpenViDial}, a large-scale multi-module dialogue dataset. The dialogue turns and visual contexts are extracted from movies and TV series, where each dialogue turn is paired with the corresponding visual context in which it takes place. OpenViDial contains a total number of 1.1 million dialogue turns, and thus 1.1 million visual contexts stored in images. Based on this dataset, we propose a family of encoder-decoder models leveraging both textual and visual contexts, from coarse-grained image features extracted from CNNs to fine-grained object features extracted from Faster R-CNNs. We observe that visual information significantly improves dialogue generation qualities, verifying the necessity of integrating multi-modal features for dialogue learning. Our work marks an important step towards large-scale multi-modal dialogue learning.