论文标题
在数据共享限制下具有异质性的集成高维多重测试与异质性
Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints
论文作者
论文摘要
在高维回归模型中识别信息性预测因子是关联分析和预测建模的关键步骤。高维设置中的信号检测通常由于样本量有限而失败。提高权力的一种方法是通过对同一科学问题进行荟萃分析进行多次研究。然而,在研究异质性之间存在的多个研究中对高维数据的综合分析具有挑战性。挑战更加明显,并具有其他数据共享约束,其中只能在不同的网站上共享摘要数据而不是个人级别的数据。在本文中,我们提出了一种新型的数据屏蔽集成大规模测试(DSILT)方法,以通过允许研究异质性并不需要共享个体级别数据来进行信号检测。假设数据的基本高维回归模型在整个研究中有所不同,但共享相似的支持,则DSILT方法结合了适当的综合估计和偏见程序,以构建特定协变量的整体影响的测试统计数据。我们还开发了多个测试程序,以在控制错误发现率(FDR)和错误发现比例(FDP)时识别出重大影响。研究了DSILT程序与理想个体的理想比较(ILMA)方法和其他分布式推理方法的理论比较。仿真研究表明,DSILT程序在虚假发现控制和获得能力方面都表现良好。所提出的方法适用于检测汀类药物和肥胖对2型糖尿病风险的遗传变异的相互作用效应的真实示例。
Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improve power is through meta-analyzing multiple studies on the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data but not individual level data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection by allowing between study heterogeneity and not requiring sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the DSILT approach incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling for false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the DSILT procedure with the ideal individual--level meta--analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the DSILT procedure performs well in both false discovery control and attaining power. The proposed method is applied to a real example on detecting interaction effect of the genetic variants for statins and obesity on the risk for Type 2 Diabetes.