论文标题
LITEMORT:自适应紧凑分布上的内存有效梯度提升树系统
LiteMORT: A memory efficient gradient boosting tree system on adaptive compact distributions
论文作者
论文摘要
梯度增强的决策树(GBDT)是许多商业和学术数据应用的领先算法。我们对该算法进行了深入的分析,尤其是直方图技术,这是具有紧凑支持的调节分布的基础。我们提出三个新的修改。 1)共享内存技术以减少内存使用量。在许多情况下,它只需要数据源本身,也不需要额外的内存。 2)与“合并溢出问题”的隐式合并。“合并溢出”意味着将一些小数据集合并到巨大的数据集,这些数据集太大而无法解决。通过隐式合并,我们只需要原始的小数据集来训练GBDT模型。 3)自适应调整算法直方图算法以提高准确性。在两次大型Kaggle竞赛中进行的实验验证了我们的方法。他们使用的内存要比LightGBM少得多,并且精度更高。我们已经在开源软件包Litemort中实现了这些算法。源代码可从https://github.com/closest-git/litemort获得
Gradient boosted decision trees (GBDT) is the leading algorithm for many commercial and academic data applications. We give a deep analysis of this algorithm, especially the histogram technique, which is a basis for the regulized distribution with compact support. We present three new modifications. 1) Share memory technique to reduce memory usage. In many cases, it only need the data source itself and no extra memory. 2) Implicit merging for "merge overflow problem"."merge overflow" means that merge some small datasets to huge datasets, which are too huge to be solved. By implicit merging, we just need the original small datasets to train the GBDT model. 3) Adaptive resize algorithm of histogram bins to improve accuracy. Experiments on two large Kaggle competitions verified our methods. They use much less memory than LightGBM and have higher accuracy. We have implemented these algorithms in an open-source package LiteMORT. The source codes are available at https://github.com/closest-git/LiteMORT