在自己的游戏中击败攻击者：使用对抗梯度方向检测对抗示例检测

论文标题

在自己的游戏中击败攻击者：使用对抗梯度方向检测对抗示例检测

Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions

论文作者

Wu, Yuhang, Arora, Sunpreet S., Wu, Yanhong, Yang, Hao

论文摘要

对抗性示例是专门为欺骗机器学习分类器而设计的输入示例。最先进的对抗示例检测方法通过量化多个扰动下的特征变化的幅度或测量其与估计的良性示例分布的距离来表征输入示例为对抗性。所提出的方法不是使用这种指标，而是基于这样的观察结果：在制作（新）对抗示例时，对抗梯度的方向在表征对抗空间方面起着关键作用。与使用多种扰动的检测方法相比，提出的方法是有效的，因为它仅在输入示例上应用单个随机扰动。在两个不同数据库（CIFAR-10和Imagenet）上进行的实验表明，所提出的检测方法分别获得了97.9％和98.6％的AUC-ROC（平均）在五种不同的对抗攻击上，并且胜过多种先进的检测方法。结果证明了使用对抗梯度方向进行对抗示例检测的有效性。

Adversarial examples are input examples that are specifically crafted to deceive machine learning classifiers. State-of-the-art adversarial example detection methods characterize an input example as adversarial either by quantifying the magnitude of feature variations under multiple perturbations or by measuring its distance from estimated benign example distribution. Instead of using such metrics, the proposed method is based on the observation that the directions of adversarial gradients when crafting (new) adversarial examples play a key role in characterizing the adversarial space. Compared to detection methods that use multiple perturbations, the proposed method is efficient as it only applies a single random perturbation on the input example. Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves, respectively, 97.9% and 98.6% AUC-ROC (on average) on five different adversarial attacks, and outperforms multiple state-of-the-art detection methods. Results demonstrate the effectiveness of using adversarial gradient directions for adversarial example detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题