一种对抗性方法，用于解释深神经网络的预测

论文标题

一种对抗性方法，用于解释深神经网络的预测

An Adversarial Approach for Explaining the Predictions of Deep Neural Networks

论文作者

Rahnama, Arash, Tseng, Andrew

论文摘要

机器学习模型已成功应用于广泛的应用程序，包括计算机视觉，自然语言处理和语音识别。然而，这些模型的成功实施通常依赖于深层神经网络（DNN），由于其难以理解的复杂性和复杂的内部机制，它们被视为不透明的黑盒系统。在这项工作中，我们提出了一种新颖的算法，用于使用对抗机学习来解释DNN的预测。我们的方法确定了基于对DNN的对抗性攻击的行为，输入特征相对于预测的相对重要性。我们的算法具有快速，一致且易于实现和解释的优势。我们介绍了我们的详细分析，该分析证明了鉴于DNN和任务的对抗性攻击的行为如何保持一致，以证明我们方法的普遍性。我们的分析使我们能够产生一致，有效的解释。我们通过使用各种DNN，任务和数据集进行实验来说明方法的有效性。最后，我们将工作与当前文献中的其他知名技术进行了比较。

Machine learning models have been successfully applied to a wide range of applications including computer vision, natural language processing, and speech recognition. A successful implementation of these models however, usually relies on deep neural networks (DNNs) which are treated as opaque black-box systems due to their incomprehensible complexity and intricate internal mechanism. In this work, we present a novel algorithm for explaining the predictions of a DNN using adversarial machine learning. Our approach identifies the relative importance of input features in relation to the predictions based on the behavior of an adversarial attack on the DNN. Our algorithm has the advantage of being fast, consistent, and easy to implement and interpret. We present our detailed analysis that demonstrates how the behavior of an adversarial attack, given a DNN and a task, stays consistent for any input test data point proving the generality of our approach. Our analysis enables us to produce consistent and efficient explanations. We illustrate the effectiveness of our approach by conducting experiments using a variety of DNNs, tasks, and datasets. Finally, we compare our work with other well-known techniques in the current literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题