通过一组不同的嵌入方式改善跨模式检索

论文标题

通过一组不同的嵌入方式改善跨模式检索

Improving Cross-Modal Retrieval with Set of Diverse Embeddings

论文作者

Kim, Dongwon, Kim, Namyup, Kwak, Suha

论文摘要

由于其固有的歧义，跨图像和文本方式的跨模式检索是一项具有挑战性的任务：图像通常表现出各种情况，并且标题可以与各种图像结合使用。已经研究了基于集合的嵌入作为解决此问题的解决方案。它试图将样品编码为一组不同的嵌入向量，以捕获样品的不同语义。在本文中，我们提出了一种基于集合的嵌入方法，该方法与以前的两个方面不同。首先，我们提出了一种称为“平滑chamfer相似性”的新相似性函数，该功能旨在减轻现有相似性功能的副作用来基于设置的嵌入。其次，我们提出了一个新型的集合预测模块，以产生一组嵌入向量，该载体有效地捕获了插槽注意机制输入的多种语义。我们的方法在可可和FlickR30K数据集上进行了不同的视觉主链的评估，在此，它的表现优于现有方法，包括在推断时需要大量计算的方法。

Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.

下载PDF全文

下载文献需遵守相关版权规定

论文标题