论文标题
具有多样性限制的匿名数据多样化
Diversifying Anonymized Data with Diversity Constraints
论文作者
论文摘要
最近引入的隐私立法旨在限制和控制公司发布并与第三方共享的个人数据量。这些真实数据的大部分不仅需要匿名化,而且还包含来自各种个人的特征细节。从网络搜索到药物和产品开发的许多应用中,这种多样性是可取的。不幸的是,数据匿名技术在很大程度上忽略了其已发布的结果的多样性。这无意间在随后的数据分析中传播了潜在的偏见。我们研究了找到多样化的匿名数据实例的问题,其中通过一组多样性约束来衡量多样性。我们对多样性的限制进行了形式化,并研究了他们的基础,例如含义和满足。我们表明,确定可以在ptime中完成多样化的匿名实例的存在,并提出基于聚类的算法。我们使用实际和合成数据进行了广泛的实验,显示了我们技术的有效性,并改善了现有基准。我们的工作通过将多样性与隐私保护数据发布耦合,符合负责任数据科学的最新趋势。
Recently introduced privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared to third parties. Much of this real data is not only sensitive requiring anonymization, but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development. Unfortunately, data anonymization techniques have largely ignored diversity in its published result. This inadvertently propagates underlying bias in subsequent data analysis. We study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints. We formalize diversity constraints and study their foundations such as implication and satisfiability. We show that determining the existence of a diverse, anonymized instance can be done in PTIME, and we present a clustering-based algorithm. We conduct extensive experiments using real and synthetic data showing the effectiveness of our techniques, and improvement over existing baselines. Our work aligns with recent trends towards responsible data science by coupling diversity with privacy-preserving data publishing.