rexup：我是为了提取，我用结构化的构图推理进行视觉问题回答

论文标题

rexup：我是为了提取，我用结构化的构图推理进行视觉问题回答

REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering

论文作者

Luo, Siwen, Han, Soyeon Caren, Sun, Kaiyuan, Poon, Josiah

论文摘要

视觉问题回答（VQA）是一项具有挑战性的多模式任务，不仅需要对图像和问题的语义理解，而且需要对逐步推理过程的合理感知，这将导致正确的答案。到目前为止，VQA中最成功的尝试仅专注于一个方面，要么是图像的视觉像素特征和问题的单词特征的相互作用，要么是用简单对象回答图像中问题的推理过程。在本文中，我们提出了一个具有明确的视觉结构 - 感知文本信息的深度推理VQA模型，并且在捕获分步推理过程并检测照片真实图像中的复杂对象关系方面效果很好。 REXUP网络由两个分支组成，面向对象和场景图形为导向，它们与超级融合组成的注意力网络共同使用。我们对GQA数据集进行定量和定性评估REXUP，并进行广泛的消融研究，以探索Rexup有效性的原因。我们的最佳模型极大地胜过了宝贵的最先进的模型，该图案在验证集上提供了92.7％，在测试-DEV集合中提供73.1％。

Visual question answering (VQA) is a challenging multi-modal task that requires not only the semantic understanding of both images and questions, but also the sound perception of a step-by-step reasoning process that would lead to the correct answer. So far, most successful attempts in VQA have been focused on only one aspect, either the interaction of visual pixel features of images and word features of questions, or the reasoning process of answering the question in an image with simple objects. In this paper, we propose a deep reasoning VQA model with explicit visual structure-aware textual information, and it works well in capturing step-by-step reasoning process and detecting a complex object-relationship in photo-realistic images. REXUP network consists of two branches, image object-oriented and scene graph oriented, which jointly works with super-diagonal fusion compositional attention network. We quantitatively and qualitatively evaluate REXUP on the GQA dataset and conduct extensive ablation studies to explore the reasons behind REXUP's effectiveness. Our best model significantly outperforms the precious state-of-the-art, which delivers 92.7% on the validation set and 73.1% on the test-dev set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题