基于方面的文档相似性研究论文

论文标题

基于方面的文档相似性研究论文

Aspect-based Document Similarity for Research Papers

论文作者

Ostendorff, Malte, Ruas, Terry, Blume, Till, Gipp, Bela, Rehm, Georg

论文摘要

传统的文件相似性度量提供了相似文档和不同文档之间的粗粒区别。通常，他们不考虑两个文档在哪些方面相似。这限制了依赖文档相似性的推荐系统等应用程序的粒度。在本文中，我们通过执行成对文档分类任务来扩展与方面信息的相似性。我们评估了基于方面的文档相似性。纸张引用表明基于方面的相似性，即引用出现的部分标题是一对引用和引用纸的标签。我们应用了一系列变压器模型，例如Roberta，Electra，XLNet和Bert变体，并将它们与LSTM基线进行比较。我们在ACL选集和Cord-19 Corpus的172,073个研究论文对的两个新构建的数据集上进行实验。我们的结果表明Scibert是最佳性能系统。定性检查验证了我们的定量结果。我们的发现激发了对基于方面的文档相似性的未来研究以及基于评估技术的推荐系统的开发。我们将公开可用的数据集，代码和训练有素的模型。

Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity for research papers. Paper citations indicate the aspect-based similarity, i.e., the section title in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. Our results show SciBERT as the best performing system. A qualitative examination validates our quantitative results. Our findings motivate future research of aspect-based document similarity and the development of a recommender system based on the evaluated techniques. We make our datasets, code, and trained models publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题