论文标题

英语词典,与SARS-COV-2和COVID-19

English dictionaries, gold and silver standard corpora for biomedical natural language processing related to SARS-CoV-2 and COVID-19

论文作者

Rashed, Salma Kazemi, Ahmed, Rafsan, Frid, Johan, Aits, Sonja

论文摘要

使用自然语言处理(NLP)工具进行自动化的信息提取,以从大量的COVID-19出版物,报告和社交媒体帖子中获得系统的见解,这些见解远远超过了人类处理能力。 NLP的一个主要挑战是用于描述医疗实体的术语的广泛差异,这对于这种新出现的疾病特别明显。在这里,我们提供一个NLP工具箱,其中包括非常大的SARS-COV-2(包括变体名称)和Covid-19的英语同义词,可与基于字典的NLP工具一起使用。我们还提出了由词典产生的银标准语料库,以及由手动注释用于疾病,病毒,症状,蛋白质/基因,细胞类型,化学和物种术语的PubMed摘要组成的黄金标准语料库,可用于训练和评估COVID-19与COVID-9与COVID相关的NLP工具。还包括可用于扩展银标准语料库或文本挖掘的注释代码。该工具箱可在GitHub(https://github.com/aitslab/corona上)和Zenodo(https://doi.org/10.5281/zenodo.6642275)免费获得。该工具箱可用于与COVID-19危机相关的各种文本分析任务,并且已经用于创建COVID-19知识图,研究与COVID-19与COVID相关术语的可变性和演变,并开发和制定和基准基准文本挖掘工具。

Automated information extraction with natural language processing (NLP) tools is required to gain systematic insights from the large number of COVID-19 publications, reports and social media posts, which far exceed human processing capabilities. A key challenge for NLP is the extensive variation in terminology used to describe medical entities, which was especially pronounced for this newly emergent disease. Here we present an NLP toolbox comprising very large English dictionaries of synonyms for SARS-CoV-2 (including variant names) and COVID-19, which can be used with dictionary-based NLP tools. We also present a silver standard corpus generated with the dictionaries, and a gold standard corpus, consisting of PubMed abstracts manually annotated for disease, virus, symptom, protein/gene, cell type, chemical and species terms, which can be used to train and evaluate COVID-19-related NLP tools. Code for annotation, which can be used to expand the silver standard corpus or for text mining is also included. This toolbox is freely available on GitHub (on https://github.com/Aitslab/corona) and zenodo (https://doi.org/10.5281/zenodo.6642275). The toolbox can be used for a variety of text analytics tasks related to the COVID-19 crisis and has already been used to create a COVID-19 knowledge graph, study the variability and evolution of COVID-19-related terminology and develop and benchmark text mining tools.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源