论文标题
捷克语法误差校正大而多样的语料库
Czech Grammar Error Correction with a Large and Diverse Corpus
论文作者
论文摘要
我们介绍了一个大型且多样化的捷克语料库,以注释语法误差校正(GEC),目的是为英语以外的其他语言中的该域中仍然稀少的数据资源做出贡献。捷克语(GECCC)的语法错误校正语料库提供了各种四个域,涵盖了从非本地人说的高误差密度论文到网站文本等误差分布,其中预期错误会少得多。我们比较了几个捷克GEC系统,其中包括几个基于变压器的系统,为未来的研究树立了强大的基准。最后,我们对数据对人类的判断进行了元评估常见的GEC指标。我们在http://hdl.handle.net/11234/1-1-4639下,根据CC BY-SA 4.0许可证公开提供新的捷克GEC语料库。
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgements on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http://hdl.handle.net/11234/1-4639 .