释放在像素级别的视觉提示的力量

论文标题

释放在像素级别的视觉提示的力量

Unleashing the Power of Visual Prompting At the Pixel Level

论文作者

Wu, Junyang, Li, Xianhang, Wei, Chen, Wang, Huiyu, Yuille, Alan, Zhou, Yuyin, Xie, Cihang

论文摘要

本文提出了一种简单有效的视觉提示方法，用于调整预训练的模型以下游识别任务。我们的方法包括两个关键设计。首先，我们将提示视为额外且独立的可学习组件，而不是将提示和图像直接添加在一起。我们表明，调解提示和图像的策略很重要，并发现围绕正确缩小的图像扭曲提示在经验上效果最好。其次，我们将通常用于构建可转移的对抗示例（即输入多样性和梯度标准化）的两个“旧技巧”重新引入视觉提示。这些技术改善了优化，并使提示能够更好地概括。我们提供了广泛的实验结果，以证明我们方法的有效性。使用剪辑模型，我们的提示方法为12个流行的分类数据集设置了82.8％平均准确性的新记录，从而使先前的ART超过 +5.6％。值得注意的是，这种提示性能已经超过线性探测的 +2.1％，甚至可以在某些数据集中进行完全微调。此外，我们的提示方法显示了各种数据量表和分配变化的竞争性能。该代码可在https://github.com/ucsc-vlaa/evp上公开获取。

This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our method includes two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable component. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two "old tricks" commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method sets a new record of 82.8% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.6%. It is worth noting that this prompting performance already outperforms linear probing by +2.1% and can even match fully fine-tuning in certain datasets. In addition, our prompting method shows competitive performance across different data scales and against distribution shifts. The code is publicly available at https://github.com/UCSC-VLAA/EVP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题