论文标题
水仙:实用的清洁标签后门攻击,信息有限
Narcissus: A Practical Clean-Label Backdoor Attack with Limited Information
论文作者
论文摘要
后门攻击将恶意数据插入训练集中,以便在推理期间,它错误分类的输入已用后门触发器修补为恶意软件指定的标签。为了绕过人类检查的后门攻击,必须正确标记注射数据。使用此类属性的攻击通常称为“清洁标签攻击”。现有的清洁标签后门攻击需要了解整个训练集有效。获得此类知识是困难或不可能的,因为通常从多个来源收集培训数据(例如,面对来自不同用户的图像)。是否仍然存在真正的威胁,这仍然是一个问题。 本文通过设计一种算法来安装清洁标签后门攻击,仅根据目标类别的代表性示例来安装清洁标签的后门攻击,从而为此问题提供了肯定的答案。由于中毒等于或小于或小于目标级数据的0.5%和训练集的0.05%,我们可以训练模型以将任意类别的测试示例分类为目标类,并在示例用后门触发器修补示例时。即使触发器在物理世界中呈现出来,我们的攻击也可以在数据集和模型中运行良好。 我们探索了防御的空间,并发现令人惊讶的是,我们的攻击可以以其香草形式逃避最新的最先进的防御,或者在简单的扭曲之后,我们可以适应下游的防御能力。我们研究了引人入胜的有效性的原因,发现由于我们的攻击合成的触发器包含与目标类别的原始语义特征一样持久的特征,因此任何尝试去除此类触发器的尝试都不可避免地会首先损害模型精度。
Backdoor attacks insert malicious data into a training set so that, during inference time, it misclassifies inputs that have been patched with a backdoor trigger as the malware specified label. For backdoor attacks to bypass human inspection, it is essential that the injected data appear to be correctly labeled. The attacks with such property are often referred to as "clean-label attacks." Existing clean-label backdoor attacks require knowledge of the entire training set to be effective. Obtaining such knowledge is difficult or impossible because training data are often gathered from multiple sources (e.g., face images from different users). It remains a question whether backdoor attacks still present a real threat. This paper provides an affirmative answer to this question by designing an algorithm to mount clean-label backdoor attacks based only on the knowledge of representative examples from the target class. With poisoning equal to or less than 0.5% of the target-class data and 0.05% of the training set, we can train a model to classify test examples from arbitrary classes into the target class when the examples are patched with a backdoor trigger. Our attack works well across datasets and models, even when the trigger presents in the physical world. We explore the space of defenses and find that, surprisingly, our attack can evade the latest state-of-the-art defenses in their vanilla form, or after a simple twist, we can adapt to the downstream defenses. We study the cause of the intriguing effectiveness and find that because the trigger synthesized by our attack contains features as persistent as the original semantic features of the target class, any attempt to remove such triggers would inevitably hurt the model accuracy first.