论文标题
闸门:无模型可变重要性的推断
Floodgate: inference for model-free variable importance
论文作者
论文摘要
许多现代应用程序旨在了解结果变量$ y $与协变量$ x $之间的关系,在(可能高维)混淆变量$ z $之间。尽管已经对测试\ emph {} $ y $的关注非常关注,这是否取决于$ x $给定的$ z $,但在本文中,我们试图通过推断该依赖性的\ emph {strength}来超越测试。我们首先定义了我们的估计数,即最小平方误差(MMSE)差距,该差距以确定性,无模型,可解释和对非线性和交互敏感的方式量化$ y $和$ x $之间的条件关系。然后,我们提出了一种称为\ emph {Greggate}的新推论方法,该方法可以利用用户选择的任何工作回归函数(例如,它是由最先进的机器学习算法拟合或从定性域知识中衍生而成的),以构建构造置信度,我们将其应用于mmse gap。 \ acc {我们还表明,闸门的准确性(从置信到限制到估算的距离)适应了工作回归函数的误差。}然后,当$ y $是二进制时,我们可以将相同的闸门原理应用于不同的可变重要性量度。最后,我们在一系列模拟中证明了闸门的表现,并将其应用于英国生物库的数据,以推断血小板对各种基因突变群体的依赖性的强度。
Many modern applications seek to understand the relationship between an outcome variable $Y$ and a covariate $X$ in the presence of a (possibly high-dimensional) confounding variable $Z$. Although much attention has been paid to testing \emph{whether} $Y$ depends on $X$ given $Z$, in this paper we seek to go beyond testing by inferring the \emph{strength} of that dependence. We first define our estimand, the minimum mean squared error (mMSE) gap, which quantifies the conditional relationship between $Y$ and $X$ in a way that is deterministic, model-free, interpretable, and sensitive to nonlinearities and interactions. We then propose a new inferential approach called \emph{floodgate} that can leverage any working regression function chosen by the user (allowing, e.g., it to be fitted by a state-of-the-art machine learning algorithm or be derived from qualitative domain knowledge) to construct asymptotic confidence bounds, and we apply it to the mMSE gap. \acc{We additionally show that floodgate's accuracy (distance from confidence bound to estimand) is adaptive to the error of the working regression function.} We then show we can apply the same floodgate principle to a different measure of variable importance when $Y$ is binary. Finally, we demonstrate floodgate's performance in a series of simulations and apply it to data from the UK Biobank to infer the strengths of dependence of platelet count on various groups of genetic mutations.