论文标题
通过正规化神经激活灵敏度提高可解释性
Improving Interpretability via Regularization of Neural Activation Sensitivity
论文作者
论文摘要
最先进的深度神经网络(DNNS)非常有效地应对许多实际任务。但是,他们在关键任务背景下的广泛采用受到两个主要弱点的阻碍 - 他们对对抗性攻击及其不透明的敏感性。前者引起了人们对在现实情况下DNN的安全性和概括的担忧,而后者阻碍了用户对其产出的信任。在这项研究中,我们(1)研究了对抗性鲁棒性对可解释性的影响,(2)提出了一种基于神经激活敏感性的正则化的DNN的新方法来提高DNN的可解释性。我们评估了使用我们的方法对使用最先进的对抗性鲁棒性技术训练的标准模型和模型的模型的解释性。我们的结果表明,对抗性强大的模型优于标准模型,并且在解释性方面,使用我们建议的方法训练的模型甚至比对抗性强大的模型更好。
State-of-the-art deep neural networks (DNNs) are highly effective at tackling many real-world tasks. However, their wide adoption in mission-critical contexts is hampered by two major weaknesses - their susceptibility to adversarial attacks and their opaqueness. The former raises concerns about the security and generalization of DNNs in real-world conditions, whereas the latter impedes users' trust in their output. In this research, we (1) examine the effect of adversarial robustness on interpretability and (2) present a novel approach for improving the interpretability of DNNs that is based on regularization of neural activation sensitivity. We evaluate the interpretability of models trained using our method to that of standard models and models trained using state-of-the-art adversarial robustness techniques. Our results show that adversarially robust models are superior to standard models and that models trained using our proposed method are even better than adversarially robust models in terms of interpretability.