通过远期变量选择来解释的随机森林模型

论文标题

通过远期变量选择来解释的随机森林模型

Interpretable random forest models through forward variable selection

论文作者

Velthoen, Jasper, Cai, Juan-Juan, Jongbloed, Geurt

论文摘要

随机森林是处理高维协变量的流行预测方法。但是，解释获得的高维和非参数模型通常是不可行的。为了获得可解释的预测模型，我们使用连续排名的概率得分（CRP）作为损失函数开发了一种前向变量选择方法。我们的逐步过程导致了最小的一组变量，该变量通过在每个步骤中进行假设检验来优化CRPS风险，以显着降低CRPS风险。我们通过证明该方法达到最佳集合，为我们的方法提供了数学动机。此外，我们表明该测试是一致的，只要分位数函数的随机森林估计器一致。在一项仿真研究中，我们将方法的性能与现有的变量选择方法进行比较，用于不同的样本量和协变量的不同相关强度。观察到我们的方法的假阳性率要低得多。我们还证明了我们方法在荷兰每日最高温度预测的统计后处理中的应用。我们的方法选择约10％的协变量，同时保持相同的预测能力。

Random forest is a popular prediction approach for handling high dimensional covariates. However, it often becomes infeasible to interpret the obtained high dimensional and non-parametric model. Aiming for obtaining an interpretable predictive model, we develop a forward variable selection method using the continuous ranked probability score (CRPS) as the loss function. Our stepwise procedure leads to a smallest set of variables that optimizes the CRPS risk by performing at each step a hypothesis test on a significant decrease in CRPS risk. We provide mathematical motivation for our method by proving that in population sense the method attains the optimal set. Additionally, we show that the test is consistent provided that the random forest estimator of a quantile function is consistent. In a simulation study, we compare the performance of our method with an existing variable selection method, for different sample sizes and different correlation strength of covariates. Our method is observed to have a much lower false positive rate. We also demonstrate an application of our method to statistical post-processing of daily maximum temperature forecasts in the Netherlands. Our method selects about 10% covariates while retaining the same predictive power.

下载PDF全文

下载文献需遵守相关版权规定

论文标题