论文标题
通过数据挖掘变量实现可靠的因果推断:测量误差问题的随机森林方法
Achieving Reliable Causal Inference with Data-Mined Variables: A Random Forest Approach to the Measurement Error Problem
论文作者
论文摘要
在研究和实践中,机器学习与计量经济学分析变得越来越普遍。一种常见的经验策略涉及将预测建模技术应用于可用数据的“地雷”变量,然后将这些变量纳入计量经济学框架,目的是估计因果关系效应。最近的工作强调,由于机器学习模型的预测不可避免地不可避免,因此基于预测变量的计量经济学分析可能会因测量误差而造成偏见。我们提出了一种减轻这些偏见的新方法,利用了称为随机森林的集成学习技术。我们建议不仅使用随机森林进行预测,还提出生成仪器变量来解决预测中嵌入的测量误差。随机森林算法在由一组树木组成的树木组成时表现最好,这些树在预测中均可准确,但也会造成“不同的”错误,即具有弱相关的预测错误。一个关键的观察结果是,这些属性与有效仪器变量的相关性和排除要求密切相关。我们设计了一个数据驱动的程序,以从随机森林中选择单个树木的元素,其中一棵树用作内源协变量,而其他树则是其仪器。仿真实验证明了所提出的方法在减轻估计偏差及其在三种偏置校正的替代方法上的效果。
Combining machine learning with econometric analysis is becoming increasingly prevalent in both research and practice. A common empirical strategy involves the application of predictive modeling techniques to 'mine' variables of interest from available data, followed by the inclusion of those variables into an econometric framework, with the objective of estimating causal effects. Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error. We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest. We propose employing random forest not just for prediction, but also for generating instrumental variables to address the measurement error embedded in the prediction. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, yet which also make 'different' mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are closely related to the relevance and exclusion requirements of valid instrumental variables. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees serve as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases and its superior performance over three alternative methods for bias correction.