论文标题
地图:多模式的不确定性感知视觉语言预训练模型
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
论文作者
论文摘要
多模式的语义理解通常必须处理不确定性,这意味着获得的消息往往是指多个目标。这种不确定性对于我们的解释是有问题的,包括模式间和模式内不确定性。很少的精力研究了这种不确定性的建模,尤其是在未标记的数据集和特定于任务的下游数据集中进行微调的预培训时。在本文中,我们通过利用序列级别的相互作用来将所有模式的表示作为概率分布的表示形式作为概率分布。与现有的确定性方法相比,这种不确定性建模可以传达更丰富的多模式语义信息和更复杂的关系。此外,我们将不确定性建模与流行的训练前框架集成并提出合适的预训练任务:基于分布的视觉语言对比度学习(D-VLC),基于分布的掩盖语言建模(D-MLM)和基于分布的图像 - 图形匹配(D-ITM)(D-ITM)。微调模型应用于挑战下游任务,包括图像文本检索,视觉问题回答,视觉推理和视觉上的范围,并实现最新的结果。
Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.