小心您想要的东西：关于对抗训练的模型的提取

论文标题

小心您想要的东西：关于对抗训练的模型的提取

Careful What You Wish For: on the Extraction of Adversarially Trained Models

论文作者

Khaled, Kacem, Nicolescu, Gabriela, de Magalhães, Felipe Gohring

论文摘要

最近对机器学习（ML）模型的攻击，例如逃避攻击，具有对抗性示例和通过提取攻击窃取的模型构成了几种安全性和隐私威胁。先前的工作建议使用对抗性训练从对抗性示例中确保模型，以逃避模型的分类并恶化其性能。但是，这种保护技术会影响模型的决策边界及其预测概率，因此可能会增加模型隐私风险。实际上，仅使用查询模型预测输出的恶意用户可以提取它并获得高精度和高保真替代模型。为了更大的提取，这些攻击利用受害者模型的预测概率。实际上，所有先前关于提取攻击的工作都没有考虑到出于安全目的的培训过程中的变化。在本文中，我们提出了一个框架，以评估具有视觉数据集对对抗训练的模型的提取攻击。据我们所知，我们的工作是第一个进行此类评估的工作。通过一项广泛的实证研究，我们证明了受对抗训练的模型比在自然训练情况下获得的模型更容易受到提取攻击。他们可以达到高达$ \ times1.2 $的准确性和同意，而比例低于$ \ times 0.75 $的查询。我们还发现，与从自然训练的（即标准）模型中提取的DNN相比，从鲁棒模型中提取的对抗性鲁棒性能力可以通过提取攻击（即从鲁棒模型提取的深神经网络（DNN）提取的深层神经网络（DNN）传递。

Recent attacks on Machine Learning (ML) models such as evasion attacks with adversarial examples and models stealing through extraction attacks pose several security and privacy threats. Prior work proposes to use adversarial training to secure models from adversarial examples that can evade the classification of a model and deteriorate its performance. However, this protection technique affects the model's decision boundary and its prediction probabilities, hence it might raise model privacy risks. In fact, a malicious user using only a query access to the prediction output of a model can extract it and obtain a high-accuracy and high-fidelity surrogate model. To have a greater extraction, these attacks leverage the prediction probabilities of the victim model. Indeed, all previous work on extraction attacks do not take into consideration the changes in the training process for security purposes. In this paper, we propose a framework to assess extraction attacks on adversarially trained models with vision datasets. To the best of our knowledge, our work is the first to perform such evaluation. Through an extensive empirical study, we demonstrate that adversarially trained models are more vulnerable to extraction attacks than models obtained under natural training circumstances. They can achieve up to $\times1.2$ higher accuracy and agreement with a fraction lower than $\times0.75$ of the queries. We additionally find that the adversarial robustness capability is transferable through extraction attacks, i.e., extracted Deep Neural Networks (DNNs) from robust models show an enhanced accuracy to adversarial examples compared to extracted DNNs from naturally trained (i.e. standard) models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题