论文标题
用验证的一代语言模型进行视频接地对话
Video-Grounded Dialogues with Pretrained Generation Language Models
论文作者
论文摘要
预先训练的语言模型在改善各种下游NLP任务方面取得了巨大的成功,因为它们能够捕获文本数据中的依赖性并产生自然响应。在本文中,我们利用预训练的语言模型的力量来改善视频接地的对话,这非常具有挑战性,涉及不同动态的复杂特征:(1)视频功能可以跨越空间和时间尺寸; (2)对话特征,涉及多个对话转弯的语义依赖性。我们通过扩展GPT-2模型来提出一个框架来解决这些挑战,通过将视频接地的对话任务作为序列到序列任务,将视觉和文本表示形式相结合到结构化序列中,并微调大型预训练的GPT-GPT-2网络。我们的框架允许微调语言模型在不同级别的信息上捕获多种方式的依赖性:在对话环境中的视频和令牌句子中的时空级别。我们从DSTC7的音频视频吸引对话(AVSD)基准方面实现了有希望的改进,该对话支持这一研究方面的潜在方向。
Pre-trained language models have shown remarkable success in improving various downstream NLP tasks due to their ability to capture dependencies in textual data and generate natural responses. In this paper, we leverage the power of pre-trained language models for improving video-grounded dialogue, which is very challenging and involves complex features of different dynamics: (1) Video features which can extend across both spatial and temporal dimensions; and (2) Dialogue features which involve semantic dependencies over multiple dialogue turns. We propose a framework by extending GPT-2 models to tackle these challenges by formulating video-grounded dialogue tasks as a sequence-to-sequence task, combining both visual and textual representation into a structured sequence, and fine-tuning a large pre-trained GPT-2 network. Our framework allows fine-tuning language models to capture dependencies across multiple modalities over different levels of information: spatio-temporal level in video and token-sentence level in dialogue context. We achieve promising improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from DSTC7, which supports a potential direction in this line of research.