对Covid-19的自动提问中语言模型的定性评估

论文标题

对Covid-19的自动提问中语言模型的定性评估

A Qualitative Evaluation of Language Models on Automatic Question-Answering for COVID-19

论文作者

Oniani, David, Wang, Yanshan

论文摘要

Covid-19导致持续的大流行，截至2020年6月12日，造成了超过740万例，死亡超过418,000。 Covid-19的高度动态和快速发展的情况使得难以获取有关该疾病的准确，按需信息。在线社区，论坛和社交媒体提供了潜在的场所来搜索相关的问题和答案，或发布问题并寻求其他成员的答案。但是，由于此类网站的性质，很少有有限的相关问题和回答来自搜索的问题，并且很少能立即回答发布的问题。随着自然语言处理领域的进步，尤其是在语言模型的领域中，可以设计可以自动回答消费者问题的聊天机器人。但是，这种模型很少在医疗领域应用和评估，以通过准确和最新的医疗保健数据满足信息需求。在本文中，我们建议采用一种语言模型来自动回答与Covid-19有关的问题，并定性地评估生成的响应。我们利用GPT-2语言模型并应用转移学习在Covid-19开放研究数据集（CORD-19）语料库上重新训练。为了提高生成的响应的质量，我们采用了4种不同的方法，即TF-IDF，Bert，Biobert，并用于过滤和保留响应中的相关句子。在绩效评估步骤中，我们要求两名医学专家对响应进行评分。我们发现Bert和Biobert的表现平均均优于TF-IDF，并且用于基于相关的句子过滤任务。此外，根据聊天机器人，我们创建了一个用户友好的交互式Web应用程序，要在线托管。

COVID-19 has resulted in an ongoing pandemic and as of 12 June 2020, has caused more than 7.4 million cases and over 418,000 deaths. The highly dynamic and rapidly evolving situation with COVID-19 has made it difficult to access accurate, on-demand information regarding the disease. Online communities, forums, and social media provide potential venues to search for relevant questions and answers, or post questions and seek answers from other members. However, due to the nature of such sites, there are always a limited number of relevant questions and responses to search from, and posted questions are rarely answered immediately. With the advancements in the field of natural language processing, particularly in the domain of language models, it has become possible to design chatbots that can automatically answer consumer questions. However, such models are rarely applied and evaluated in the healthcare domain, to meet the information needs with accurate and up-to-date healthcare data. In this paper, we propose to apply a language model for automatically answering questions related to COVID-19 and qualitatively evaluate the generated responses. We utilized the GPT-2 language model and applied transfer learning to retrain it on the COVID-19 Open Research Dataset (CORD-19) corpus. In order to improve the quality of the generated responses, we applied 4 different approaches, namely tf-idf, BERT, BioBERT, and USE to filter and retain relevant sentences in the responses. In the performance evaluation step, we asked two medical experts to rate the responses. We found that BERT and BioBERT, on average, outperform both tf-idf and USE in relevance-based sentence filtering tasks. Additionally, based on the chatbot, we created a user-friendly interactive web application to be hosted online.

下载PDF全文

下载文献需遵守相关版权规定

论文标题