论文标题
评估爱沙尼亚网络文本上的句子细分和单词令牌化系统
Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts
论文作者
论文摘要
从Web获得的文本很嘈杂,不一定遵循拼字法和单词边界规则。因此,在未经编辑的Web文本上,已经开发出在形成良好的文本上开发的句子分割和单词令牌化系统可能不会很好。在本文中,我们首先描述了爱沙尼亚网络数据集的句子边界的手动注释,然后在此语料库上介绍三个现有句子细分和单词令牌化系统的评估结果:estnltk,stanza和udpipe。尽管estnltk与该数据集上的句子细分中的其他系统相比获得了最高的性能,但STANZA和UDPIPE的句子分割性能仍远低于更良好的Estonian UD测试集中获得的结果。
Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evaluation results of three existing sentence segmentation and word tokenization systems on this corpus: EstNLTK, Stanza and UDPipe. While EstNLTK obtains the highest performance compared to other systems on sentence segmentation on this dataset, the sentence segmentation performance of Stanza and UDPipe remains well below the results obtained on the more well-formed Estonian UD test set.