在硬标签黑匣子设置中产生自然语言攻击

论文标题

在硬标签黑匣子设置中产生自然语言攻击

Generating Natural Language Attacks in a Hard Label Black Box Setting

论文作者

Maheshwary, Rishabh, Maheshwary, Saket, Pudi, Vikram

论文摘要

我们研究了在硬标签设置中攻击自然语言处理模型的一项重要而挑战的任务。我们提出了一种基于决策的攻击策略，该策略在文本分类和累积任务上制定了高质量的对抗性示例。我们提出的攻击策略通过仅观察目标模型预测的顶级标签来利用基于人群的优化算法来制作合理和语义上相似的对抗性示例。在每次迭代中，优化过程允许替换单词，以最大程度地提高原始文本和对抗文本之间的整体语义相似性。此外，我们的方法不依赖于使用替代模型或任何类型的培训数据。我们通过对七个基准数据集的五个最先进的目标模型进行了广泛的实验和消融研究来证明我们提出的方法的功效。与先前文献中提出的攻击相比，我们能够在高度限制的环境中获得更高的成功率，而较低的单词扰动百分比也是如此。

We study an important and challenging task of attacking natural language processing models in a hard label black box setting. We propose a decision-based attack strategy that crafts high quality adversarial examples on text classification and entailment tasks. Our proposed attack strategy leverages population-based optimization algorithm to craft plausible and semantically similar adversarial examples by observing only the top label predicted by the target model. At each iteration, the optimization procedure allow word replacements that maximizes the overall semantic similarity between the original and the adversarial text. Further, our approach does not rely on using substitute models or any kind of training data. We demonstrate the efficacy of our proposed approach through extensive experimentation and ablation studies on five state-of-the-art target models across seven benchmark datasets. In comparison to attacks proposed in prior literature, we are able to achieve a higher success rate with lower word perturbation percentage that too in a highly restricted setting.

下载PDF全文

下载文献需遵守相关版权规定

论文标题