在强大的机器学习模型中进行后门攻击和防御

论文标题

在强大的机器学习模型中进行后门攻击和防御

Towards Backdoor Attacks and Defense in Robust Machine Learning Models

论文作者

Soremekun, Ezekiel, Udeshi, Sakshi, Chattopadhyay, Sudipta

论文摘要

强大优化的引入推动了防御对抗攻击的最先进。值得注意的是，最先进的投影梯度下降（PGD）的训练方法已被证明是普遍且可靠地有效地防御对抗输入的。这种鲁棒性方法将PGD用作可靠且通用的“一阶对手”。但是，这种优化的行为尚未根据称为后门的根本不同类别的攻击。在本文中，我们研究了如何为使用基于PGD的强大优化训练的强大模型注射和防御后门攻击。我们证明这些模型容易受到后门攻击的影响。随后，我们观察到后门反映在此类模型的特征表示中。然后，利用该观察结果来通过称为AEGIS的检测技术来检测这种后门感染的模型。具体而言，鉴于使用基于PGD的一阶对抗训练方法对强大的深神经网络（DNN）进行了训练，AEGIS使用功能聚类来有效检测该DNN是后门感染还是清洁。在我们对使用CIFAR-10，MNIST和FMNIST数据集的主要分类任务上的几个可见和隐藏后门触发器的评估中，AEGIS有效地检测到感染了后门的PGD训练的强大DNN。 AEGIS检测到具有91.6％精度的后门感染模型（在12个测试模型中有11个），而没有任何假阳性。此外，AEGIS以相当低（11.1％）的假阳性率检测到后门感染模型中的目标类别。我们的调查表明，对抗性强大的DNN的显着特征可能会有望打破后门攻击的隐秘性质。

The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. Notably, the state-of-the-art projected gradient descent (PGD)-based training method has been shown to be universally and reliably effective in defending against adversarial inputs. This robustness approach uses PGD as a reliable and universal "first-order adversary". However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we study how to inject and defend against backdoor attacks for robust models trained using PGD-based robust optimisation. We demonstrate that these models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this observation is leveraged to detect such backdoor-infected models via a detection technique called AEGIS. Specifically, given a robust Deep Neural Network (DNN) that is trained using PGD-based first-order adversarial training approach, AEGIS uses feature clustering to effectively detect whether such DNNs are backdoor-infected or clean. In our evaluation of several visible and hidden backdoor triggers on major classification tasks using CIFAR-10, MNIST and FMNIST datasets, AEGIS effectively detects PGD-trained robust DNNs infected with backdoors. AEGIS detects such backdoor-infected models with 91.6% accuracy (11 out of 12 tested models), without any false positives. Furthermore, AEGIS detects the targeted class in the backdoor-infected model with a reasonably low (11.1%) false positive rate. Our investigation reveals that salient features of adversarially robust DNNs could be promising to break the stealthy nature of backdoor attacks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题