论文标题
BADNL:针对NLP模型的后门攻击,具有语义保护的改进
BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
论文作者
论文摘要
在过去的十年中,深层神经网络(DNN)迅速发展,并已在各种现实世界中部署。同时,DNN模型已被证明容易受到安全和隐私攻击的影响。最近引起了很多关注的攻击是后门攻击。具体而言,对手的毒物毒物毒物的训练设置了误导目标类别的秘密触发器的任何输入。 以前的后门攻击主要集中在计算机视觉(CV)应用程序(例如图像分类)上。在本文中,我们对NLP模型的后门攻击进行了系统的研究,并提出了NLP一般的后门攻击框架Badnl,其中包括新的攻击方法。具体而言,我们提出了三种构建触发器的方法,即BadChar,Badword和坏字,包括基本和语义性的变体。我们的攻击获得了几乎完美的攻击成功率,对原始模型的实用程序产生可忽略的影响。例如,使用BadChar,我们的后门攻击达到了98.9%的攻击成功率,而SST-5数据集的实用性提高了1.5%,而仅中毒了原始集合的3%。此外,我们进行了一项用户研究,以证明我们的触发因素可以从人类的角度保存语义。
Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model's training set to mislead any input with an added secret trigger to a target class. Previous backdoor attacks predominantly focus on computer vision (CV) applications, such as image classification. In this paper, we perform a systematic investigation of backdoor attack on NLP models, and propose BadNL, a general NLP backdoor attack framework including novel attack methods. Specifically, we propose three methods to construct triggers, namely BadChar, BadWord, and BadSentence, including basic and semantic-preserving variants. Our attacks achieve an almost perfect attack success rate with a negligible effect on the original model's utility. For instance, using the BadChar, our backdoor attack achieves a 98.9% attack success rate with yielding a utility improvement of 1.5% on the SST-5 dataset when only poisoning 3% of the original set. Moreover, we conduct a user study to prove that our triggers can well preserve the semantics from humans perspective.