与文本反馈的视觉搜索的模态 - 敏锐的注意融合

论文标题

与文本反馈的视觉搜索的模态 - 敏锐的注意融合

Modality-Agnostic Attention Fusion for visual search with text feedback

论文作者

Dodds, Eric, Culpepper, Jack, Herdade, Simao, Zhang, Yang, Boakye, Kofi

论文摘要

具有自然语言反馈的图像检索提供了基于细粒度的视觉功能的目录搜索的希望，这些视觉功能超出了对象和二进制属性，从而促进了现实世界中的应用程序，例如电子商务。我们的模态 - 不足的注意融合（MAAF）模型结合了图像和文本功能，并在两个视觉搜索上的现有方法与修改短语数据集，时尚智商和CSS结合了现有方法，并在仅具有单字修改的数据集上竞争性地执行了竞争性，时尚200k。我们还介绍了两个新的具有挑战性的基准测试，这些基准是根据鸟类到词和斑点进行了适应的，它们为新的设置提供了丰富的语言输入，我们表明我们没有修改的方法优于强大的基线。为了更好地理解我们的模型，我们对时尚智商进行了详细的消融，并提供了令人惊讶的单词现象的可视化，以避免“参与”他们所指的图像区域。

Image retrieval with natural language feedback offers the promise of catalog search based on fine-grained visual features that go beyond objects and binary attributes, facilitating real-world applications such as e-commerce. Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search with modifying phrase datasets, Fashion IQ and CSS, and performs competitively on a dataset with only single-word modifications, Fashion200k. We also introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs, and we show that our approach without modification outperforms strong baselines. To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.

下载PDF全文

下载文献需遵守相关版权规定

论文标题