论文标题

IR评估的联合上限和下限归一化

Joint Upper & Lower Bound Normalization for IR Evaluation

论文作者

Santu, Shubhra Kanti Karmaker, Feng, Dongji

论文摘要

在本文中,我们提出了一个新的评估指标家族,介绍了对IR评估的新观点,即现有的流行指标(例如NDCG,MAP)是通过引入特定于特定特异性的下限(LB)标准化项来定制的。虽然原始的NDCG,地图等根据理想排名列表,指标是根据其上限进行标准化的,但尚未研究其相应的LB归一化。具体而言,我们介绍了提出的LB归一化的两个不同变体,其中下限是根据评估集中存在的相应文档的随机排名估算的。接下来,我们通过实例化了两个流行的IR评估度量的新框架(例如两个变体,例如DCG_UL_V1,2和MSP_UL_V1,2),然后与传统的指标进行比较,而无需提出的LB归一化,我们就进行了两个案例研究。在两个不同的数据集上使用八个学习到级别(LETOR)方法进行的实验证明了新LB标准化度量的以下特性:1)关于原始度量的统计学上的显着差异(在两种方法之间)在原始度量方面不再具有统计学意义,而在上层(UL)上限(UL)有限的归一化版本和vice vice-vice-vice-vice-vice-vice-vice,尤其是对非信息性查询范围。 2)与原始指标进行比较时,我们提出的UL标准化指标表现出更高的歧视功率和不同数据集的一致性。这些发现表明,当有必要计算NDCG和MAP以及对一般IR评估的更深入的UL标准化研究时,应认真考虑UL归一化。

In this paper, we present a novel perspective towards IR evaluation by proposing a new family of evaluation metrics where the existing popular metrics (e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound (LB) normalization term. While original nDCG, MAP etc. metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding LB normalization for them has not yet been studied. Specifically, we introduce two different variants of the proposed LB normalization, where the lower bound is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case-studies by instantiating the new framework for two popular IR evaluation metric (with two variants, e.g., DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric without the proposed LB normalization. Experiments on two different data-sets with eight Learning-to-Rank (LETOR) methods demonstrate the following properties of the new LB normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Lower (UL) Bound normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UL normalized metrics demonstrate higher Discriminatory Power and better Consistency across different data-sets. These findings suggest that the IR community should consider UL normalization seriously when computing nDCG and MAP and more in-depth study of UL normalization for general IR evaluation is warranted.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源