论文标题
多方机器学习中数据集属性的泄漏
Leakage of Dataset Properties in Multi-Party Machine Learning
论文作者
论文摘要
安全的多方机器学习使几个方可以在其汇总数据上构建模型,以增加实用程序,同时彼此明确共享数据。我们表明,即使当事方仅获得对最终模型的Black-Box访问,这种多方计算即使当事方之间的全局数据集属性也会导致全局数据集属性的泄漏。特别是,一个``好奇的''党可以以很高的准确性来推断其他各方数据中敏感属性的分布。这引起了人们对与整个数据集有关的属性机密性而不是单个数据记录的保密性。我们表明,我们的攻击可以在不同类型的数据集中泄漏人口级的属性,包括表格,文本和图形数据。为了理解和测量泄漏的来源,我们考虑了敏感属性和其余数据之间的几种相关模型。使用多个机器学习模型,我们显示泄漏发生,即使敏感属性不包含在训练数据中,并且与其他属性或目标变量的相关性较低。
Secure multi-party machine learning allows several parties to build a model on their pooled data to increase utility while not explicitly sharing data with each other. We show that such multi-party computation can cause leakage of global dataset properties between the parties even when parties obtain only black-box access to the final model. In particular, a ``curious'' party can infer the distribution of sensitive attributes in other parties' data with high accuracy. This raises concerns regarding the confidentiality of properties pertaining to the whole dataset as opposed to individual data records. We show that our attack can leak population-level properties in datasets of different types, including tabular, text, and graph data. To understand and measure the source of leakage, we consider several models of correlation between a sensitive attribute and the rest of the data. Using multiple machine learning models, we show that leakage occurs even if the sensitive attribute is not included in the training data and has a low correlation with other attributes or the target variable.