IR评估的联合上限和下限归一化

论文标题

IR评估的联合上限和下限归一化

Joint Upper & Lower Bound Normalization for IR Evaluation

论文作者

Santu, Shubhra Kanti Karmaker, Feng, Dongji

论文摘要

在本文中，我们提出了一个新的评估指标家族，介绍了对IR评估的新观点，即现有的流行指标（例如NDCG，MAP）是通过引入特定于特定特异性的下限（LB）标准化项来定制的。虽然原始的NDCG，地图等根据理想排名列表，指标是根据其上限进行标准化的，但尚未研究其相应的LB归一化。具体而言，我们介绍了提出的LB归一化的两个不同变体，其中下限是根据评估集中存在的相应文档的随机排名估算的。接下来，我们通过实例化了两个流行的IR评估度量的新框架（例如两个变体，例如DCG_UL_V1,2和MSP_UL_V1,2），然后与传统的指标进行比较，而无需提出的LB归一化，我们就进行了两个案例研究。在两个不同的数据集上使用八个学习到级别（LETOR）方法进行的实验证明了新LB标准化度量的以下特性：1）关于原始度量的统计学上的显着差异（在两种方法之间）在原始度量方面不再具有统计学意义，而在上层（UL）上限（UL）有限的归一化版本和vice vice-vice-vice-vice-vice-vice-vice，尤其是对非信息性查询范围。 2）与原始指标进行比较时，我们提出的UL标准化指标表现出更高的歧视功率和不同数据集的一致性。这些发现表明，当有必要计算NDCG和MAP以及对一般IR评估的更深入的UL标准化研究时，应认真考虑UL归一化。

In this paper, we present a novel perspective towards IR evaluation by proposing a new family of evaluation metrics where the existing popular metrics (e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound (LB) normalization term. While original nDCG, MAP etc. metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding LB normalization for them has not yet been studied. Specifically, we introduce two different variants of the proposed LB normalization, where the lower bound is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case-studies by instantiating the new framework for two popular IR evaluation metric (with two variants, e.g., DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric without the proposed LB normalization. Experiments on two different data-sets with eight Learning-to-Rank (LETOR) methods demonstrate the following properties of the new LB normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Lower (UL) Bound normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UL normalized metrics demonstrate higher Discriminatory Power and better Consistency across different data-sets. These findings suggest that the IR community should consider UL normalization seriously when computing nDCG and MAP and more in-depth study of UL normalization for general IR evaluation is warranted.

下载PDF全文

下载文献需遵守相关版权规定

论文标题