论文标题
第二波UD希伯来树堤和跨域解析
A Second Wave of UD Hebrew Treebanking and Cross-Domain Parsing
论文作者
论文摘要
迄今为止,基本的希伯来语NLP任务(例如分割,标记和解析)已依赖于希伯来语树库的各种版本(HTB,Sima'an等,2001)。但是,HTB中的数据是单源新闻新闻语料库,现在已有30多年的历史了,并且没有涵盖网络上当代希伯来语的许多方面。本文介绍了希伯来语的新的,可免费获得的希伯来语树库,这些树仓是从希伯来语Wikipedia选择的一系列主题中分层的。除了引入语料库并评估其注释的质量外,我们还基于Grew(Guillaume,2021)部署自动验证工具,并在希伯来语中进行了第一个跨域解析实验。我们使用最新语言建模和基于现有的基于变压器的方法的一些增量改进的组合,在UD NLP任务上获得了新的最新(SOTA)结果。我们还发布了新版本的新版本的新版本从我们的新语料库中发布。
Foundational Hebrew NLP tasks such as segmentation, tagging and parsing, have relied to date on various versions of the Hebrew Treebank (HTB, Sima'an et al. 2001). However, the data in HTB, a single-source newswire corpus, is now over 30 years old, and does not cover many aspects of contemporary Hebrew on the web. This paper presents a new, freely available UD treebank of Hebrew stratified from a range of topics selected from Hebrew Wikipedia. In addition to introducing the corpus and evaluating the quality of its annotations, we deploy automatic validation tools based on grew (Guillaume, 2021), and conduct the first cross domain parsing experiments in Hebrew. We obtain new state-of-the-art (SOTA) results on UD NLP tasks, using a combination of the latest language modelling and some incremental improvements to existing transformer based approaches. We also release a new version of the UD HTB matching annotation scheme updates from our new corpus.