接地视觉表示与域概括的文本

论文标题

接地视觉表示与域概括的文本

Grounding Visual Representations with Texts for Domain Generalization

论文作者

Min, Seonwoo, Park, Nokyung, Kim, Siwon, Park, Seunghyun, Kim, Jinkyu

论文摘要

减少源和目标域之间的表示形式差异是最大化模型概括的关键组件。在这项工作中，我们倡导利用自然语言监督域的概括任务。我们将两个模块引入地面视觉表示形式，其中包含人类的典型推理文本：（1）视觉和文本关节嵌入器以及（2）文本解释发生器。前者学习了图像文本的关节嵌入空间，我们可以将高级类别歧视性信息接地到模型中。后者利用了一个可解释的模型，并生成了解释，证明其决定背后的理由是合理的。据我们所知，这是为域泛化任务利用视觉和语言跨模式方法的第一项工作。我们使用新创建的Cub-DG基准数据集进行的实验表明，可以成功地将跨模式监督用于接地域不变的视觉表示并改善模型的概括。此外，在大规模的域基准测试中，我们提出的方法可实现最先进的结果，并在五个多域数据集的平均性能中排名第一。数据集和代码可在https://github.com/mswzeus/gvrt上找到。

Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We introduce two modules to ground visual representations with texts containing typical reasoning of humans: (1) Visual and Textual Joint Embedder and (2) Textual Explanation Generator. The former learns the image-text joint embedding space where we can ground high-level class-discriminative information into the model. The latter leverages an explainable model and generates explanations justifying the rationale behind its decision. To the best of our knowledge, this is the first work to leverage the vision-and-language cross-modality approach for the domain generalization task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate that cross-modality supervision can be successfully used to ground domain-invariant visual representations and improve the model generalization. Furthermore, in the large-scale DomainBed benchmark, our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets. The dataset and codes are available at https://github.com/mswzeus/GVRT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题