改进的变压器时代的生物医学单词嵌入

论文标题

改进的变压器时代的生物医学单词嵌入

Improved Biomedical Word Embeddings in the Transformer Era

论文作者

Noh, Jiho, Kavuluru, Ramakanth

论文摘要

生物医学单词的嵌入通常通过捕获局部和全球分布属性的神经方法进行自由文本语料库进行预训练。它们使用各种神经体系结构将其杠杆作用，这些神经体系结构旨在优化特定于任务的目标，以进一步调整此类嵌入。但是，自2018年以来，从这些静态嵌入到语言模型（例如Elmo，Elmo，Bert和Ulmfit）动机的上下文嵌入中存在明显的转变。这些动态的嵌入具有额外的好处，即能够在其上下文中区分异构词和缩写词。但是，静态嵌入仍然与低资源设置（例如智能设备，IoT元素）相关，并从计算语言学角度研究词汇语义。在本文中，我们首先使用Skip-gram方法共同学习单词和概念的嵌入，并通过在生物医学引用中相互结合的医学主题标题（网格）概念来进一步对其进行微调。通过在两个句子输入模式下的BERT变压器体系结构来实现此微调，并具有捕获网格对共发生的分类目标。从本质上讲，我们重新利用变压器体系结构（通常用于生成动态嵌入），以使用概念相关性来改善静态嵌入。我们使用多个数据集对这些调谐静态嵌入的评估进行评估，以解决先前努力所开发的单词相关性。如果没有选择性地提出概念和术语（正如先前的努力所追求的那样），我们相信我们对迄今为止对静态嵌入的最详尽评估，全面改进了性能。我们为下游应用程序和研究努力提供代码和嵌入方式：https：//github.com/bionlproc/bert-crel-crel-embeddings

Biomedical word embeddings are usually pre-trained on free text corpora with neural methods that capture local and global distributional properties. They are leveraged in downstream tasks using various neural architectures that are designed to optimize task-specific objectives that might further tune such embeddings. Since 2018, however, there is a marked shift from these static embeddings to contextual embeddings motivated by language models (e.g., ELMo, transformers such as BERT, and ULMFiT). These dynamic embeddings have the added benefit of being able to distinguish homonyms and acronyms given their context. However, static embeddings are still relevant in low resource settings (e.g., smart devices, IoT elements) and to study lexical semantics from a computational linguistics perspective. In this paper, we jointly learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information manifesting in co-occurring Medical Subject Heading (MeSH) concepts in biomedical citations. This fine-tuning is accomplished with the BERT transformer architecture in the two-sentence input mode with a classification objective that captures MeSH pair co-occurrence. In essence, we repurpose a transformer architecture (typically used to generate dynamic embeddings) to improve static embeddings using concept correlations. We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts. Without selectively culling concepts and terms (as was pursued by previous efforts), we believe we offer the most exhaustive evaluation of static embeddings to date with clear performance improvements across the board. We provide our code and embeddings for public use for downstream applications and research endeavors: https://github.com/bionlproc/BERT-CRel-Embeddings

下载PDF全文

下载文献需遵守相关版权规定

论文标题