论文标题
评估选择程序的质量:误报率的下限与评估者间可靠性的关系
Assessing quality of selection procedures: Lower bound of false positive rate as a function of inter-rater reliability
论文作者
论文摘要
评估者间的可靠性(IRR)是评估来自多个评估者评级质量的常用工具之一。但是,基于多个评估者评级的申请人选择程序通常会导致二元结果;申请人是选择是否选择。 IRR中不考虑此最终结果,而是关注单个受试者或对象的评分。我们概述了评分的测量模型(用于IRR)和二进制分类框架之间的连接。我们开发了一种简单的方法,即近似正确选择最佳申请人的概率,该方法使我们能够计算选择过程的错误概率(即假阳性和假阴性率)或其下限。我们在评估者间的可靠性与二进制分类指标之间建立了联系,这表明二元分类指标仅取决于IRR系数和所选申请人的比例。我们在模拟研究中评估了近似值的性能,并将其应用于比较多个授予同行评审选择程序的可靠性的示例中。我们还讨论了在其他情况下探索连接的其他可能用途,例如教育测试,心理评估和与健康有关的测量,并在IRR2FPR R软件包中实施了计算。
Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss possible other uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement and implement the computations in IRR2FPR R package.