知识增强多模式学习的调查

论文标题

知识增强多模式学习的调查

A survey on knowledge-enhanced multimodal learning

论文作者

Lymperaiou, Maria, Stamou, Giorgos

论文摘要

多模式学习一直是越来越兴趣的领域，旨在将各种方式结合在单个联合表示中。特别是在Visiol语言学（VL）学习多种模型和技术的领域，针对各种涉及图像和文本的任务。 VL模型通过扩展变压器的想法来达到前所未有的性能，以便两种方式都可以互相学习。尽管可以识别出许多差距，但大规模的预训练程序使VL模型能够获得一定程度的现实了解：对常识，事实，时间和其他日常知识方面的理解有限，质疑VL任务的可扩展性。知识图和其他知识源可以通过明确提供缺失的信息，解锁VL模型的新功能来填补这些空白。同时，知识图提高了决策的解释性，公平性和有效性，对于这种复杂实施而言最重要的问题。当前的调查旨在统一VL表示学习和知识图的领域，并提供对知识增强VL模型的分类学和分析。

Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题