对分位数回归的快速推断，数千万观察结果

论文标题

对分位数回归的快速推断，数千万观察结果

Fast Inference for Quantile Regression with Tens of Millions of Observations

论文作者

Lee, Sokbae, Liao, Yuan, Seo, Myung Hwan, Shin, Youngki

论文摘要

大数据分析已经开发了经济研究的新途径，但是分析数千万观察结果的数据集的挑战是巨大的。基于极端估计器的常规计量经济学方法需要大量的计算资源和内存，这些计算资源和内存通常不容易获得。在本文中，我们专注于应用于“超大”数据集的线性分位数回归，例如美国十年型人口普查。提出了快速推理框架，利用随机亚级别下降（S-SUBGD）更新。推理过程依次处理横截面数据：（i）使用每个传入的“新观察”，（ii）将其作为$ \ textit {polyak-ruppert} $平均值汇总，（iii）计算仅使用解决方案路径的求职率的关键统计量来更新参数估计。该方法从时间序列回归中获取，以通过随机缩放创建渐近关键的统计量。我们提出的测试统计量是以完全在线的方式计算的，并且未经重采样而计算临界值。我们进行了广泛的数值研究，以展示我们提出的推断的计算优点。对于$（n，d）\ sim（10^7，10^3）$的推理问题，其中$ n $是样本尺寸，$ d $是回归器的数量，我们的方法会生成新的见解，超过计算中当前的推理方法。我们的方法专门揭示了使用数百万观察结果的美国大学工资保费的性别差距的趋势，同时控制了超过$ 10^3 $的协变量，以减轻混杂效应。

Big data analytics has opened new avenues in economic research, but the challenge of analyzing datasets with tens of millions of observations is substantial. Conventional econometric methods based on extreme estimators require large amounts of computing resources and memory, which are often not readily available. In this paper, we focus on linear quantile regression applied to "ultra-large" datasets, such as U.S. decennial censuses. A fast inference framework is presented, utilizing stochastic subgradient descent (S-subGD) updates. The inference procedure handles cross-sectional data sequentially: (i) updating the parameter estimate with each incoming "new observation", (ii) aggregating it as a $\textit{Polyak-Ruppert}$ average, and (iii) computing a pivotal statistic for inference using only a solution path. The methodology draws from time-series regression to create an asymptotically pivotal statistic through random scaling. Our proposed test statistic is calculated in a fully online fashion and critical values are calculated without resampling. We conduct extensive numerical studies to showcase the computational merits of our proposed inference. For inference problems as large as $(n, d) \sim (10^7, 10^3)$, where $n$ is the sample size and $d$ is the number of regressors, our method generates new insights, surpassing current inference methods in computation. Our method specifically reveals trends in the gender gap in the U.S. college wage premium using millions of observations, while controlling over $10^3$ covariates to mitigate confounding effects.

下载PDF全文

下载文献需遵守相关版权规定

论文标题