详细信息：将图像详细信息注入剪辑的特征空间

论文标题

详细信息：将图像详细信息注入剪辑的特征空间

DetailCLIP: Injecting Image Details into CLIP's Feature Space

论文作者

Zhang, Zilun, Shen, Cuifeng, Shen, Yuan, Xiong, Huixin, Zhou, Xinyu, Zhao, Tiancheng, Yin, Jianwei

论文摘要

尽管类似夹的视觉语言模型为图像和文本提供了功能性关节功能空间，但由于类似CILP的模型的图像输入大小（例如224）的限制，如果我们输入高分辨率图像（例如2240），则在功能表示中丢失了细微的细节。在这项工作中，我们引入了一个有效的框架，该框架可以为高分辨率图像产生单个特征表示，该图像注入图像详细信息并共享与原始剪辑相同的语义空间。在框架中，我们根据精心设计的图像补丁方法提取的剪辑功能训练功能融合模型，该模型可以覆盖任何刻度的对象，该对象由图像敏捷的类别较弱地监视，以提示查询。我们通过从类中检索图像来验证我们的框架，从而在现实世界和合成数据集上进行了疑问，从而在这些任务上显示了显着的性能改进。此外，为了充分展示我们的框架的详细检索能力，我们构建了一个类似于CLEVR的合成数据集，称为Clver-DS，该数据集已完全注释并具有可控的对象量表。

Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework's detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.

下载PDF全文

下载文献需遵守相关版权规定

论文标题