论文标题
仿制的树木用于无模型变量选择
Knockoff Boosted Tree for Model-Free Variable Selection
论文作者
论文摘要
在本文中,我们提出了一种新的策略,用于使用带有增强树模型的仿基拓扑知识的情况下进行变量选择。我们的方法灵感来自原始仿制方法,其中原始变量和仿冒变量之间的差异用于通过错误发现率控制的变量选择。原始方法将拉索用于回归模型,并假设样品多于变量。我们将此方法扩展到无模型和高维变量选择。我们提出了两种新的抽样方法来产生仿冒品,即稀疏的协方差和主成分仿冒方法。我们测试了这些方法,并将它们与原始仿制方法进行比较,以控制I型错误和功率的能力。增压树模型是一个复杂的系统,比具有更简单假设的模型具有更多的超参数。在我们的框架中,这些超参数要么通过贝叶斯优化调整,要么固定在多个层次上以进行趋势检测。在仿真测试中,我们还比较了树模型的重要性测试统计的属性和性能。结果包括不同仿冒品和重要性测试统计的组合。我们还考虑包括主要效应,相互作用,指数和二阶模型的场景,同时假设真实的模型结构未知。我们使用癌症基因组图集(TCGA)基因表达数据应用算法进行肿瘤纯度估计和肿瘤分类。所提出的算法包含在KOBT软件包中,可在\ url {https://cran.r-project.org/web/packages/kobt/index.html}中获得。
In this article, we propose a novel strategy for conducting variable selection without prior model topology knowledge using the knockoff method with boosted tree models. Our method is inspired by the original knockoff method, where the differences between original and knockoff variables are used for variable selection with false discovery rate control. The original method uses Lasso for regression models and assumes there are more samples than variables. We extend this method to both model-free and high-dimensional variable selection. We propose two new sampling methods for generating knockoffs, namely the sparse covariance and principal component knockoff methods. We test these methods and compare them with the original knockoff method in terms of their ability to control type I errors and power. The boosted tree model is a complex system and has more hyperparameters than models with simpler assumptions. In our framework, these hyperparameters are either tuned through Bayesian optimization or fixed at multiple levels for trend detection. In simulation tests, we also compare the properties and performance of importance test statistics of tree models. The results include combinations of different knockoffs and importance test statistics. We also consider scenarios that include main-effect, interaction, exponential, and second-order models while assuming the true model structures are unknown. We apply our algorithm for tumor purity estimation and tumor classification using the Cancer Genome Atlas (TCGA) gene expression data. The proposed algorithm is included in the KOBT package, available at \url{https://cran.r-project.org/web/packages/KOBT/index.html}.