textVQA的空间意识到多模式变压器

论文标题

textVQA的空间意识到多模式变压器

Spatially Aware Multimodal Transformers for TextVQA

论文作者

Kant, Yash, Batra, Dhruv, Anderson, Peter, Schwing, Alex, Parikh, Devi, Lu, Jiasen, Agrawal, Harsh

论文摘要

文本提示对于日常任务至关重要，例如购买杂货和使用公共交通工具。为了开发这种辅助技术，我们研究了文本VQA任务，即有关图像中文本的推理以回答问题。现有的方法在使用空间关系方面受到限制，并依靠完全连接的类似变形金刚的架构来隐式学习场景的空间结构。相比之下，我们提出了一个新颖的空间意识的自我发项层，以便每个视觉实体仅着眼于由空间图定义的相邻实体。此外，多头自我发项层中的每个头部都集中在不同的关系子集上。我们的方法具有两个优点：（1）每个头部考虑当地环境，而不是将注意力散布在所有视觉实体之间；（2）我们避免学习冗余功能。我们表明，我们的模型在改进的基线上将当前最新方法的绝对准确性提高了2.2％，而涉及空间推理的问题4.62％，可以使用OCR代币正确回答。同样，在ST-VQA上，我们将绝对精度提高了4.2％。我们进一步表明，具有空间意识的自我注意力可以改善视觉接地。

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题