LDKP：用于从长长科学文档中识别键形的数据集

论文标题

LDKP：用于从长长科学文档中识别键形的数据集

LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

论文作者

Mahata, Debanjan, Agarwal, Navneet, Gautam, Dibya, Kumar, Amardeep, Parekh, Swapnil, Singla, Yaman Kumar, Acharya, Anish, Shah, Rajiv Ratn

论文摘要

从文本文档中识别键形（KP）是自然语言处理和信息检索的基本任务。此任务的绝大多数基准数据集来自仅包含文档标题和摘要信息的科学领域。这限制了键形提取（KPE）和键形生成（KPG）算法，以识别通常很短的人写的摘要（大约8个句子）。这对现实世界的应用提出了三个挑战：对于大多数文档而言，人体写的摘要几乎总是很长，并且KP的很大比例直接在标题和抽象的有限上下文之外。因此，我们发布了两个广泛的Corpora映射KP约为1.3m和〜100K的科学文章，其完全提取的文本和其他元数据，包括出版物场地，年份，作者，研究领域以及引文，以促进对这个现实世界中问题的研究。

Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (approx 8 sentences). This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M and ~100K scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题