论文标题

KPQA:使用键形权重回答生成问题的度量

KPQA: A Metric for Generative Question Answering Using Keyphrase Weights

论文作者

Lee, Hwanhee, Yoon, Seunghyun, Dernoncourt, Franck, Kim, Doo Soon, Bui, Trung, Shin, Joongbo, Jung, Kyomin

论文摘要

在对生成问题答案(GenQA)系统的自动评估中,由于答案的自由形式,很难评估生成答案的正确性。特别是,使用广泛使用的n-gram相似性度量通常无法区分错误的答案,因为它们同样考虑所有令牌。为了减轻这个问题,我们提出了KPQA-Metric,这是一种评估GenQA正确性的新指标。具体而言,我们的新指标通过键形预测为每个令牌分配了不同的权重,从而判断生成的答案句子是否捕获了参考答案的关键含义。为了评估我们的指标,我们在两个GenQA数据集上创建了高质量的人类正确性判断。使用我们的人类评估数据集,我们表明我们提出的指标与人类判断的相关性明显高于现有指标。该代码可从https://github.com/hwanheelee1993/kpqa获得。

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA-metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics. The code is available at https://github.com/hwanheelee1993/KPQA.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源