猜测视觉对话的状态跟踪

论文标题

猜测视觉对话的状态跟踪

Guessing State Tracking for Visual Dialogue

论文作者

Pang, Wei, Wang, Xiaojie

论文摘要

猜测者是猜猜视觉基础的任务吗？喜欢视觉对话。它将目标对象定位在甲骨文中，通过提问者和甲骨文之间的基于问答的对话，将目标对象定位在图像中。大多数现有的猜测者在与预定义的巡回赛的对话中收到了所有问答后的猜测。本文提出了猜测者的猜测状态，并将猜测视为通过对话改变猜测状态的过程。因此，提出了一个基于猜测状态跟踪的猜测模型。猜测状态定义为图像中对象上的分布。在手上，将两个损失功能定义为模型培训的监督。早期的监督在早期回合给猜测带来了监督，而渐进的监督为猜测状态带来了单调性。关于猜测什么的实验结果？数据集表明，我们的模型可以极大地胜过以前的模型，这实现了新的最先进的模型，尤其是猜测83.3％的成功率正在接近84.4％的人类水平准确性。

The Guesser is a task of visual grounding in GuessWhat?! like visual dialogue. It locates the target object in an image supposed by an Oracle oneself over a question-answer based dialogue between a Questioner and the Oracle. Most existing guessers make one and only one guess after receiving all question-answer pairs in a dialogue with the predefined number of rounds. This paper proposes a guessing state for the Guesser, and regards guess as a process with change of guessing state through a dialogue. A guessing state tracking based guess model is therefore proposed. The guessing state is defined as a distribution on objects in the image. With that in hand, two loss functions are defined as supervisions for model training. Early supervision brings supervision to Guesser at early rounds, and incremental supervision brings monotonicity to the guessing state. Experimental results on GuessWhat?! dataset show that our model significantly outperforms previous models, achieves new state-of-the-art, especially the success rate of guessing 83.3% is approaching the human-level accuracy of 84.4%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题