VQA-GNN：通过图神经网络具有多模式知识的推理，以回答视觉问题

论文标题

VQA-GNN：通过图神经网络具有多模式知识的推理，以回答视觉问题

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

论文作者

Wang, Yanan, Yasunaga, Michihiro, Ren, Hongyu, Wada, Shinya, Leskovec, Jure

论文摘要

视觉问题回答（VQA）要求系统通过统一非结构化（例如，所讨论和答案的上下文；“ QA上下文”）和结构化（例如，QA上下文和场景的知识图；“概念图”）来执行概念级推理。现有作品通常通过连接相应的视觉节点和概念节点来结合场景图和场景的概念图，然后将质量上下文表示表示以执行问答。但是，这些方法仅执行从非结构化知识到结构化知识的单向融合，从而限制了它们在知识的异质方式上捕获关节推理的潜力。为了执行更有表现力的推理，我们提出了VQA-GNN，这是一种新的VQA模型，在非结构化和结构化的多模式知识之间执行双向融合以获得统一的知识表示。具体而言，我们通过代表质量质量上下文的超级节点进行了连接场景图和概念图，并引入了一种新的多模式GNN技术，以执行模式间消息传递，以减轻模态之间的表示差距。在两个具有挑战性的VQA任务（VCR和GQA）上，我们的方法在VCR（Q-AR）上优于强大的基线VQA方法，而GQA则优于4.6％，这表明其在执行概念级别的推理方面的实力。消融研究进一步证明了双向融合和多模式GNN方法在统一非结构化和结构化的多模式知识方面的功效。

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题