论文标题
Fairgen:公平的合成数据生成
FairGen: Fair Synthetic Data Generation
论文作者
论文摘要
随着跨银行,制药,ED-Tech等整个领域的机器学习的不断增加,采用负责任的AI方法以确保模型不会不公平地歧视任何小组已成为最重要的。鉴于缺乏干净的训练数据,因此首选生成的对抗技术来生成合成数据,该数据具有从非结构化数据,例如文本,图像到结构化数据集建模欺诈检测等的各个领域的几个最先进的架构。这些技术克服了几个挑战,例如班级失衡,有限的培训数据,由于隐私问题而受到限制访问数据。现有的工作重点是生成公平数据,要么适用于某种gan体系结构,要么很难在整个gan上调整。在本文中,我们提出了一条管道,以生成独立于GAN体系结构的更公平的合成数据。提出的论文利用预处理算法来识别和去除引起样本的偏差。特别是,我们声称在生成合成数据的同时,大多数gan都扩大了培训数据中存在的偏差,但通过消除这些偏见诱导样本,甘斯基本上将更多地集中在实际内容丰富的样本上。我们对两个开源数据集的实验评估表明,在某些情况下,提出的管道如何生成公平数据以及改进的性能。
With the rising adoption of Machine Learning across the domains like banking, pharmaceutical, ed-tech, etc, it has become utmost important to adopt responsible AI methods to ensure models are not unfairly discriminating against any group. Given the lack of clean training data, generative adversarial techniques are preferred to generate synthetic data with several state-of-the-art architectures readily available across various domains from unstructured data such as text, images to structured datasets modelling fraud detection and many more. These techniques overcome several challenges such as class imbalance, limited training data, restricted access to data due to privacy issues. Existing work focusing on generating fair data either works for a certain GAN architecture or is very difficult to tune across the GANs. In this paper, we propose a pipeline to generate fairer synthetic data independent of the GAN architecture. The proposed paper utilizes a pre-processing algorithm to identify and remove bias inducing samples. In particular, we claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples. Our experimental evaluation on two open-source datasets demonstrates how the proposed pipeline is generating fair data along with improved performance in some cases.