Scirex：文档级信息提取的挑战数据集

论文标题

Scirex：文档级信息提取的挑战数据集

SciREX: A Challenge Dataset for Document-Level Information Extraction

论文作者

Jain, Sarthak, van Zuylen, Madeleine, Hajishirzi, Hannaneh, Beltagy, Iz

论文摘要

从完整文档中提取信息是许多域中的重要问题，但是大多数以前的工作都集中在识别句子或段落中的关系。在文档级别上创建大规模信息提取（IE）数据集是一项挑战，因为它需要对整个文档的理解来注释实体及其文档级别的关系，这些关系通常超出句子甚至部分。在本文中，我们介绍了Scirex，Scirex是一个文档级别的IE数据集，该数据集涵盖了多个IE任务，包括显着实体识别和文档级别$ n $ n $ ary-ary关系识别。我们通过整合自动和人类注释来注释数据集，从而利用现有的科学知识资源。我们开发了一个神经模型作为强大的基线，该基线将先前的最新IE模型扩展到文档级别IE。分析模型性能显示人类绩效与当前基线之间的差距很大，邀请社区使用我们的数据集作为开发文档级别IE模型的挑战。我们的数据和代码可在https://github.com/allenai/scirex上公开获取。

Extracting information from full documents is an important problem in many domains, but most previous work focus on identifying relationships within a sentence or a paragraph. It is challenging to create a large-scale information extraction (IE) dataset at the document level since it requires an understanding of the whole document to annotate entities and their document-level relationships that usually span beyond sentences or even sections. In this paper, we introduce SciREX, a document level IE dataset that encompasses multiple IE tasks, including salient entity identification and document level $N$-ary relation identification from scientific articles. We annotate our dataset by integrating automatic and human annotations, leveraging existing scientific knowledge resources. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE. Analyzing the model performance shows a significant gap between human performance and current baselines, inviting the community to use our dataset as a challenge to develop document-level IE models. Our data and code are publicly available at https://github.com/allenai/SciREX

下载PDF全文

下载文献需遵守相关版权规定

论文标题