论文标题
使用嘈杂的学习来识别流行病相关的推文
Identifying epidemic related Tweets using noisy learning
论文作者
论文摘要
监督的学习算法在注释的数据集上非常依赖于训练机器学习模型。但是,由于所涉及的手动工作,注释数据集的策划非常费力,而且耗时,并且已成为监督学习中的巨大瓶颈。在这项工作中,我们将嘈杂的学习理论应用于产生薄弱的监督信号而不是手动注释。我们使用标签启发式词来确定与流行病相关的推文来策划一个标有嘈杂的数据集。我们使用大量流行病语料库评估了性能,我们的结果表明,在类不平衡和多分类的弱监督设置中训练有嘈杂数据的模型实现了大于90%的绩效。
Supervised learning algorithms are heavily reliant on annotated datasets to train machine learning models. However, the curation of the annotated datasets is laborious and time consuming due to the manual effort involved and has become a huge bottleneck in supervised learning. In this work, we apply the theory of noisy learning to generate weak supervision signals instead of manual annotation. We curate a noisy labeled dataset using a labeling heuristic to identify epidemic related tweets. We evaluated the performance using a large epidemic corpus and our results demonstrate that models trained with noisy data in a class imbalanced and multi-classification weak supervision setting achieved performance greater than 90%.