论文标题
形态问题:多语言语言建模分析
Morphology Matters: A Multilingual Language Modeling Analysis
论文作者
论文摘要
关于多语言建模的先前研究(例如,Cotterell等,2018; Mielke等,2019)在拐点形态是否使语言更难建模的情况下不同意。我们试图解决分歧并扩展这些研究。我们以92种语言和大量的类型特征编译了145种圣经翻译的较大语料库。我们填写了几种语言的缺失类型数据,除了专家生产的类型学特征外,还考虑了基于语料库的形态复杂性的度量。我们发现,当LSTM模型接受BPE分段数据训练时,几种形态学测量与较高的惊喜显着相关。我们还研究了语言动机的子词细分策略,例如默菲斯和有限状态传感器(FSTS),发现这些细分策略会产生更好的性能,并减少语言形态对语言建模的影响。
Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.