论文标题
平衡取样的稳定预测
Balance-Subsampled Stable Prediction
论文作者
论文摘要
在机器学习中,通常认为培训和测试数据共享相同的人群分布。但是,在实践中通常会违反此假设,因为样本选择偏见可能会导致从训练数据到测试数据的分布转移。这种模型不足的分布转移通常会导致未知测试数据的预测不稳定。在本文中,我们提出了一种基于分数阶乘设计理论的新型平衡取样稳定预测(BSSP)算法。它从混杂变量中隔离了每个预测因子的明显影响。设计理论分析表明,所提出的方法可以减少由分布转移引起的预测因素之间的混杂效应,因此提高了参数估计的准确性和预测稳定性。对合成和现实世界数据集的数值实验表明,我们的BSSP算法在未知的测试数据上明显优于基线方法,以稳定预测。
In machine learning, it is commonly assumed that training and test data share the same population distribution. However, this assumption is often violated in practice because the sample selection bias may induce the distribution shift from training data to test data. Such a model-agnostic distribution shift usually leads to prediction instability across unknown test data. In this paper, we propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design. It isolates the clear effect of each predictor from the confounding variables. A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift, hence improve both the accuracy of parameter estimation and prediction stability. Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.