论文标题
使用基于BERT的神经网络预测患者电子健康记录的临床诊断
Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks
论文作者
论文摘要
在本文中,我们研究了从文本电子健康记录(EHR)数据中预测临床诊断的问题。我们展示了此问题在医学界的重要性,并介绍了对问题和拟议方法的全面历史回顾。作为主要的科学贡献,我们提出了从变压器(BERT)模型中的双向编码器表示的修改,用于序列分类,该模型实现了一种新型的完全连接(FC)层组成的方式,并且仅在域数据上预审预定的BERT模型。为了从经验上验证我们的模型,我们使用大规模的俄罗斯EHR数据集,该数据集由约400万个独特的患者访问组成。这是俄罗斯语言的最大研究,也是全球最大的研究之一。我们对265个ICD-10疾病子集的多类分类任务进行了许多比较实验。该实验表明,与其他基线相比,我们的模型的性能提高了,包括微调的俄罗斯伯特(Rubert)变体。我们还通过经验丰富的医学专家小组展示了模型的可比性。这使我们希望实施该系统将减少误诊。
In this paper we study the problem of predicting clinical diagnoses from textual Electronic Health Records (EHR) data. We show the importance of this problem in medical community and present comprehensive historical review of the problem and proposed methods. As the main scientific contributions we present a modification of Bidirectional Encoder Representations from Transformers (BERT) model for sequence classification that implements a novel way of Fully-Connected (FC) layer composition and a BERT model pretrained only on domain data. To empirically validate our model, we use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits. This is the largest such study for the Russian language and one of the largest globally. We performed a number of comparative experiments with other text representation models on the task of multiclass classification for 265 disease subset of ICD-10. The experiments demonstrate improved performance of our models compared to other baselines, including a fine-tuned Russian BERT (RuBERT) variant. We also show comparable performance of our model with a panel of experienced medical experts. This allows us to hope that implementation of this system will reduce misdiagnosis.