部分类型的NER数据集集成：将实践连接到理论

论文标题

部分类型的NER数据集集成：将实践连接到理论

Partially-Typed NER Datasets Integration: Connecting Practice to Theory

论文作者

Zhi, Shi, Liu, Liyuan, Zhang, Yu, Wang, Shiyin, Li, Qi, Zhang, Chao, Han, Jiawei

论文摘要

虽然典型的命名实体识别（NER）模型需要用所有目标类型对培训集进行注释，但每个可用数据集可能仅覆盖其中的一部分。与其依靠完全符合的NER数据集，不如为利用多个部分型训练的努力进行培训，并允许所得模型覆盖完整的类型集。但是，既没有保证集成数据集的质量，也没有关于培训算法设计的指导。在这里，我们以理论和经验方式进行了系统分析和部分分析和完全类型的数据集之间的比较。首先，我们得出一个有必要确定经过部分训练的注释的模型可以通过训练有完全类型的注释的模型达到类似的性能，这也提供了有关算法设计的指导。此外，我们进行了受控的实验，该实验显示了部分类型的数据集在经过相同数量的全型注释训练的模型中导致相似的性能

While typical named entity recognition (NER) models require the training set to be annotated with all target types, each available datasets may only cover a part of them. Instead of relying on fully-typed NER datasets, many efforts have been made to leverage multiple partially-typed ones for training and allow the resulting model to cover a full type set. However, there is neither guarantee on the quality of integrated datasets, nor guidance on the design of training algorithms. Here, we conduct a systematic analysis and comparison between partially-typed NER datasets and fully-typed ones, in both theoretical and empirical manner. Firstly, we derive a bound to establish that models trained with partially-typed annotations can reach a similar performance with the ones trained with fully-typed annotations, which also provides guidance on the algorithm design. Moreover, we conduct controlled experiments, which shows partially-typed datasets leads to similar performance with the model trained with the same amount of fully-typed annotations

下载PDF全文

下载文献需遵守相关版权规定

论文标题