萨姆维尔：重新评估摘要评估

论文标题

萨姆维尔：重新评估摘要评估

SummEval: Re-evaluating Summarization Evaluation

论文作者

Fabbri, Alexander R., Kryściński, Wojciech, McCann, Bryan, Xiong, Caiming, Socher, Richard, Radev, Dragomir

论文摘要

关于文本摘要评估指标的全面最新研究的稀缺性以及评估方案缺乏共识的稀缺性继续抑制进展。我们解决了沿五个维度的汇总评估方法的现有缺点：1）我们使用神经摘要模型的输出以及专家和拥挤的人类注释以全面且一致的方式重新评估14个自动评估指标，2）我们一致地通过跨越的自动评估模型来汇集了33），我们始终如一地汇集了跨度的计算，我们将汇集汇总，3） CNN/DailyMail news dataset and share it in a unified format, 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and众源工人。我们希望这项工作将有助于促进文本摘要的更完整的评估协议，并在开发与人类判断更好的评估指标方面进行预先研究。

The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations, 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics, 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format, 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题