论文标题
评估对匿名数据攻击引起的重新识别风险
Assessing the risk of re-identification arising from an attack on anonymised data
论文作者
论文摘要
目的:将常规获得的医学数据用于研究目的,需要通过数据匿名来保护患者的机密性。这项工作的目的是计算出对匿名数据集的恶意攻击引起的重新识别风险,如下所述。方法:我们首先提出了一种分析方法,用于估计在电子健康记录(EHR)数据的K匿名数据集中重新识别单个患者的概率。其次,我们概括了该解决方案,以获得多个被重新识别的患者的概率。我们通过蒙特卡洛模拟提供合成验证,以说明获得的估计值的准确性。结果:提出的风险估计分析框架提供了重新识别概率,这些概率与模拟在许多情况下提供的框架一致。我们的工作受到膨胀重新识别概率的保守假设的限制。讨论:我们的估计表明,重新识别概率随着数据集的比例而恶意获得,并且与等价类规模有反比关系。我们的递归方法将适用性域扩展到任意K-匿名方案中多人重新识别攻击的一般情况。结论:我们规定了一种系统的方法来基于预定的重新识别概率来参数K-匿名过程。我们观察到,当人们正在考虑基于对敌方恶意获得的数据集中的重新识别概率时,k-size降低的重新识别风险的好处可能不值得降低数据粒度。
Objective: The use of routinely-acquired medical data for research purposes requires the protection of patient confidentiality via data anonymisation. The objective of this work is to calculate the risk of re-identification arising from a malicious attack to an anonymised dataset, as described below. Methods: We first present an analytical means of estimating the probability of re-identification of a single patient in a k-anonymised dataset of Electronic Health Record (EHR) data. Second, we generalize this solution to obtain the probability of multiple patients being re-identified. We provide synthetic validation via Monte Carlo simulations to illustrate the accuracy of the estimates obtained. Results: The proposed analytical framework for risk estimation provides re-identification probabilities that are in agreement with those provided by simulation in a number of scenarios. Our work is limited by conservative assumptions which inflate the re-identification probability. Discussion: Our estimates show that the re-identification probability increases with the proportion of the dataset maliciously obtained and that it has an inverse relationship with the equivalence class size. Our recursive approach extends the applicability domain to the general case of a multi-patient re-identification attack in an arbitrary k-anonymisation scheme. Conclusion: We prescribe a systematic way to parametrize the k-anonymisation process based on a pre-determined re-identification probability. We observed that the benefits of a reduced re-identification risk that come with increasing k-size may not be worth the reduction in data granularity when one is considering benchmarking the re-identification probability on the size of the portion of the dataset maliciously obtained by the adversary.