论文标题
量化合成数据中隐私风险的统一框架
A Unified Framework for Quantifying Privacy Risk in Synthetic Data
论文作者
论文摘要
通常将合成数据作为一种方法作为一种方法,用于通过复制原始数据的全局统计属性,而无需披露任何个人的敏感信息,以一种以隐私保护方式共享敏感信息。实际上,与其他匿名方法一样,隐私风险不能完全消除。需要对剩余的隐私风险进行剩余的风险。我们提出了匿名者,这是一个统计框架,可以在合成表格数据集中共同量化不同类型的隐私风险。根据欧洲一般数据保护法规(GDPR),我们为该框架配备了基于攻击的评估,这是事实匿名的三个关键指标。据我们所知,我们是第一个对合成数据的这三种隐私风险进行连贯且合法地对齐的评估的人,并设计了直接建模列出和可连接性风险的隐私攻击。我们通过进行一系列广泛的实验来证明我们的方法的有效性,这些实验衡量了有意插入隐私泄漏的数据的隐私风险,以及具有和没有差异隐私的合成数据。我们的结果表明,我们的框架量表与数据中的隐私泄漏量线性量表报告了三种隐私风险。此外,我们观察到合成数据表现出针对连接性的最低脆弱性,表明未保留真实数据和合成数据记录之间的一对一关系。最后,我们定量地证明,匿名者在检测隐私泄漏以及计算速度方面均优于现有的合成数据隐私评估框架。为了促进综合数据的隐私意识使用,我们在https://github.com/statice/anonymeter上开源匿名。
Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without disclosing sensitive information about any individual. In practice, as with other anonymization methods, privacy risks cannot be entirely eliminated. The residual privacy risks need instead to be ex-post assessed. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, and to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, we demonstrate quantitatively that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we open source Anonymeter at https://github.com/statice/anonymeter.