通过从回流试验中恢复实验基准来比较统计调整方法的性能

论文标题

通过从回流试验中恢复实验基准来比较统计调整方法的性能

Comparing the Performance of Statistical Adjustment Methods By Recovering the Experimental Benchmark from the REFLUX Trial

论文作者

Keele, Luke, O'Neill, Stephen, Grieve, Richard

论文摘要

比较有效性研究中的许多证据是基于观察性研究。进行观察性研究的研究人员通常认为，治疗组和对照组之间没有明显的差异。在调整治疗和对照之间观察到的差异后，估计治疗效果。但是，由于模型错误指定，治疗效果估计可能会偏差。也就是说，如果治疗效果估计的方法施加了过分强大的功能形式假设，则治疗效应估计可能会显着偏见。在这项研究中，我们比较了多种治疗效应估计方法的性能。我们在英国反流研究的背景下这样做。在回流中，在研究资格后，参与者被招募到随机试验臂或患者偏好部门。在随机试验中，将患者随机分配到手术或医疗管理。在患者的偏好部门中，参与者被选为手术或医疗管理。我们尝试使用研究的患者偏好部门的数据从随机试验部门中恢复治疗效果估计。我们改变了治疗效果估计的方法，并记录哪些方法是成功的，哪些方法不是。我们应用了20多种不同的方法，包括标准回归模型以及高级机器学习方法。我们发现简单的倾向分数匹配方法执行最差。我们还发现跨方法的性能差异很大。性能的广泛差异表明，分析师应使用多种估计方法作为鲁棒性检查。

Much evidence in comparative effectiveness research is based on observational studies. Researchers who conduct observational studies typically assume that there are no unobservable differences between the treated and control groups. Treatment effects are estimated after adjusting for observed differences between treated and controls. However, treatment effect estimates may be biased due to model misspecification. That is, if the method of treatment effect estimation imposes unduly strong functional form assumptions, treatment effect estimates may be significantly biased. In this study, we compare the performance of a wide variety of treatment effect estimation methods. We do so within the context of the REFLUX study from the UK. In REFLUX, after study qualification, participants were enrolled in either a randomized trial arm or patient preference arm. In the randomized trial, patients were randomly assigned to either surgery or medical management. In the patient preference arm, participants selected to either have surgery or medical management. We attempt to recover the treatment effect estimate from the randomized trial arm using the data from the patient preference arm of the study. We vary the method of treatment effect estimation and record which methods are successful and which are not. We apply over 20 different methods including standard regression models as well as advanced machine learning methods. We find that simple propensity score matching methods perform the worst. We also find significant variation in performance across methods. The wide variation in performance suggests analysts should use multiple methods of estimation as a robustness check.

下载PDF全文

下载文献需遵守相关版权规定

论文标题