论文标题
模型从培训中从培训中学到什么,而不是文本?测量视觉常识知识
What do Models Learn From Training on More Than Text? Measuring Visual Commonsense Knowledge
论文作者
论文摘要
仅从文本学习语言就存在局限性。因此,最近的重点是开发多模型模型。但是,很少有基准可以衡量从多模式培训中学到的语言模型的知识。我们假设在视觉方式上进行培训应该改善语言模型中的视觉常识知识。因此,我们介绍了两个评估任务,以测量语言模型中的视觉常识知识,并使用它们来评估不同的多模式模型和单峰基线。首先,我们发现在视觉文本数据训练的多模型模型和单峰基线模型之间,视觉常识性知识并没有显着差异。
There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training. We hypothesize that training on a visual modality should improve on the visual commonsense knowledge in language models. Therefore, we introduce two evaluation tasks for measuring visual commonsense knowledge in language models and use them to evaluate different multimodal models and unimodal baselines. Primarily, we find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.