IR-BERT：利用Bert进行语义搜索，以链接新闻文章

论文标题

IR-BERT：利用Bert进行语义搜索，以链接新闻文章

IR-BERT: Leveraging BERT for Semantic Search in Background Linking for News Articles

论文作者

Deshmukh, Anup Anand, Sethi, Udhav

论文摘要

这项工作描述了我们针对TREC 2020新闻曲目的背景链接任务的两种方法。该任务的主要目的是建议读者应提到的相关文章列表，以了解查询文章的上下文并获得背景信息。我们的第一种方法着重于通过结合从查询文档中提取的加权关键字并使用BM25进行检索来构建有效的搜索查询。第二种方法利用Sbert（Nils Reimers等人）的能力学习查询的上下文表示，以便对语料库进行语义搜索。我们从经验上表明，采用语言模型使我们在理解上下文以及查询文章的背景方面有益于我们的方法。在TREC 2018 Washington Post数据集上评估了所提出的方法，我们最好的模型优于TREC中位数以及2018年的最高评分模型，该模型在NDCG@5 Metric方面都优于TREC中位数。我们进一步提出了一项多样性措施，以评估各种方法在检索各种文档方面的有效性。这可能会激励研究人员在推荐列表中介绍多样性。我们已经开放了对Github的实施，并计划提交TREC 2020中背景链接任务的运行。

This work describes our two approaches for the background linking task of TREC 2020 News Track. The main objective of this task is to recommend a list of relevant articles that the reader should refer to in order to understand the context and gain background information of the query article. Our first approach focuses on building an effective search query by combining weighted keywords extracted from the query document and uses BM25 for retrieval. The second approach leverages the capability of SBERT (Nils Reimers et al.) to learn contextual representations of the query in order to perform semantic search over the corpus. We empirically show that employing a language model benefits our approach in understanding the context as well as the background of the query article. The proposed approaches are evaluated on the TREC 2018 Washington Post dataset and our best model outperforms the TREC median as well as the highest scoring model of 2018 in terms of the nDCG@5 metric. We further propose a diversity measure to evaluate the effectiveness of the various approaches in retrieving a diverse set of documents. This would potentially motivate researchers to work on introducing diversity in their recommended list. We have open sourced our implementation on Github and plan to submit our runs for the background linking task in TREC 2020.

下载PDF全文

下载文献需遵守相关版权规定

论文标题