大规模异质学术网络中的成对学习名称歧义的名称歧义

论文标题

大规模异质学术网络中的成对学习名称歧义的名称歧义

Pairwise Learning for Name Disambiguation in Large-Scale Heterogeneous Academic Networks

论文作者

Sun, Qingyun, Peng, Hao, Li, Jianxin, Wang, Senzhang, Dong, Xiangyu, Zhao, Liangxuan, Yu, Philip S., He, Lifang

论文摘要

名称“歧义”旨在确定具有相同名称的独特作者。现有的名称歧义方法总是利用作者属性来增强歧义结果。但是，某些歧视作者属性（例如，电子邮件和隶属关系）可能会因毕业或失业而改变，这将导致在数字库中分离同一作者的论文。尽管这些属性可能会发生变化，但作者的合着者和研究主题不会随着时间而变化，这意味着一段时间内的论文在学术网络中具有相似的文本和关系信息。受这个想法的启发，我们引入了基于多视图的基于注意的成对复发性神经网络（MA-PAIRNN），以解决名称消除歧义问题。我们将根据Ma-Pairnn的成对分类结果合并，根据歧视作者属性和同一作者的块将论文分为小块。 Ma-Pairrnn将嵌入学习和成对相似性学习的异质图结合到框架中。除了属性和结构信息外，Ma-Pairrnn还通过Meta-Path利用语义信息，并以归纳方式生成节点表示，这是可扩展到大图的。此外，采用了一种语义级别的注意机制来融合多个基于元路径的表示。一个由两个RNN组成的伪 - 塞亚姆网络以出版时间顺序将两个纸序列作为输入，并输出它们的相似性。两个现实世界数据集的结果表明，我们的框架在名称歧义任务上具有显着且一致的性能。还证明，Ma-Pairnn可以通过少量的培训数据表现出色，并且在不同的研究领域具有更好的概括能力。

Name disambiguation aims to identify unique authors with the same name. Existing name disambiguation methods always exploit author attributes to enhance disambiguation results. However, some discriminative author attributes (e.g., email and affiliation) may change because of graduation or job-hopping, which will result in the separation of the same author's papers in digital libraries. Although these attributes may change, an author's co-authors and research topics do not change frequently with time, which means that papers within a period have similar text and relation information in the academic network. Inspired by this idea, we introduce Multi-view Attention-based Pairwise Recurrent Neural Network (MA-PairRNN) to solve the name disambiguation problem. We divided papers into small blocks based on discriminative author attributes and blocks of the same author will be merged according to pairwise classification results of MA-PairRNN. MA-PairRNN combines heterogeneous graph embedding learning and pairwise similarity learning into a framework. In addition to attribute and structure information, MA-PairRNN also exploits semantic information by meta-path and generates node representation in an inductive way, which is scalable to large graphs. Furthermore, a semantic-level attention mechanism is adopted to fuse multiple meta-path based representations. A Pseudo-Siamese network consisting of two RNNs takes two paper sequences in publication time order as input and outputs their similarity. Results on two real-world datasets demonstrate that our framework has a significant and consistent improvement of performance on the name disambiguation task. It was also demonstrated that MA-PairRNN can perform well with a small amount of training data and have better generalization ability across different research areas.

下载PDF全文

下载文献需遵守相关版权规定

论文标题