干预触发预测模型的性能指标不能反映使用模型的预期降低结果

论文标题

干预触发预测模型的性能指标不能反映使用模型的预期降低结果

Performance metrics for intervention-triggering prediction models do not reflect an expected reduction in outcomes from using the model

论文作者

Schuler, Alejandro, Bhardwaj, Aashish, Liu, Vincent

论文摘要

临床研究人员经常使用基于混乱矩阵的标准机器学习指标选择并评估风险预测模型。但是，如果这些模型用于将干预措施分配给患者，则根据回顾性数据计算得出的标准指标仅与某些假设下的模型效用（根据结果的减少）有关。当整个过程中反复提供预测（例如，在患者遇到的情况下）时，标准指标与效用之间的关系更加复杂。文献中已经使用了几种评估，但是尚不清楚每种评估的估计目标是什么。我们合成这些方法，确定每个方法的估计值，并根据这些估计有效的假设进行讨论。我们使用模拟数据以及用于设计预警系统设计的实际数据来证明我们的见解。我们的理论和经验结果表明，没有介入数据的评估不能估计有意义的数量，需要强大的假设，或者仅限于估计最佳场景范围。

Clinical researchers often select among and evaluate risk prediction models using standard machine learning metrics based on confusion matrices. However, if these models are used to allocate interventions to patients, standard metrics calculated from retrospective data are only related to model utility (in terms of reductions in outcomes) under certain assumptions. When predictions are delivered repeatedly throughout time (e.g. in a patient encounter), the relationship between standard metrics and utility is further complicated. Several kinds of evaluations have been used in the literature, but it has not been clear what the target of estimation is in each evaluation. We synthesize these approaches, determine what is being estimated in each of them, and discuss under what assumptions those estimates are valid. We demonstrate our insights using simulated data as well as real data used in the design of an early warning system. Our theoretical and empirical results show that evaluations without interventional data either do not estimate meaningful quantities, require strong assumptions, or are limited to estimating best-case scenario bounds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题