重新思考预训练语言模型的文本对抗防御

论文标题

重新思考预训练语言模型的文本对抗防御

Rethinking Textual Adversarial Defense for Pre-trained Language Models

论文作者

Wang, Jiayi, Bao, Rongzhou, Zhang, Zhuosheng, Zhao, Hai

论文摘要

尽管预训练的语言模型（PRLMS）取得了重大成功，但最近的研究表明，PRLMS容易受到对抗攻击的影响。通过生成对不同级别（句子 /词 /角色）轻微扰动的对抗性示例，对抗性攻击可以欺骗PRLMS产生错误的预测，这质疑PRLMS的鲁棒性。但是，我们发现大多数现有的文本对抗性示例都是不自然的，可以通过人类和机器来区分。基于一般的异常检测器，我们提出了一种新颖的度量（异常程度），以此来实现当前的对抗攻击方法，以产生更自然和不可察觉的对抗性例子。在这一新约束下，现有攻击的成功率急剧下降，这表明PRLMS的鲁棒性并不像他们声称的那么脆弱。此外，我们发现四种类型的随机化可以使很大一部分文本对抗示例无效。基于异常检测器和随机化，我们设计了一个通用的防御框架，该框架是最早在不知道特定攻击的情况下执行文本对抗防御的人之一。经验结果表明，我们的普遍防御框架与其他特定防御能力达到了可比甚至更高的后攻击精度，同时保留了更高的原始精度。我们的工作揭示了文本对抗性攻击的本质，并表明（1）进一步的对抗性攻击的作品应更多地关注如何克服检测和抵抗随机化，否则将很容易检测和无效。（2）与不自然且可感知的对抗性例子相比，正是那些无法检测到的对抗性例子对PRLM构成了真正的风险，并且需要更多关注未来的鲁棒性增强策略。

Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (1) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (2) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题