论文标题
智能数据驱动的决策树集合方法不平衡大数据
Smart Data driven Decision Trees Ensemble Methodology for Imbalanced Big Data
论文作者
论文摘要
每个类别的数据大小的差异(也称为数据分布)已成为影响数据质量的常见问题。大数据方案对传统不平衡分类算法提出了新的挑战,因为它们不准备使用此类数据。由于使用MAPREDUCE范式,在少数群体中的拆分数据策略和缺乏数据在大数据方案中提出了应对类之间的不平衡的新挑战。合奏已证明能够成功解决数据不平衡问题。智能数据是指足够质量的数据以实现高性能模型。通过大数据预处理实现的合奏和智能数据的结合应该是一种很好的协同作用。在本文中,我们提出了一种新型的智能数据驱动决策树集合方法,用于解决大数据域中的不平衡分类问题,即SD_DETE方法。该方法基于使用分布式质量数据在整体过程中学习不同决策树的学习。通过融合随机离散化,主成分分析和基于聚类的随机过采样来实现此质量数据,以获取原始数据的不同智能数据版本。在21个二元改编的数据集中进行的实验表明,我们的方法的表现优于随机森林。
Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing Random Discretization, Principal Components Analysis and clustering-based Random Oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms Random Forest.