形态问题：多语言语言建模分析

论文标题

形态问题：多语言语言建模分析

Morphology Matters: A Multilingual Language Modeling Analysis

论文作者

Park, Hyunji Hayley, Zhang, Katherine J., Haley, Coleman, Steimel, Kenneth, Liu, Han, Schwartz, Lane

论文摘要

关于多语言建模的先前研究（例如，Cotterell等，2018； Mielke等，2019）在拐点形态是否使语言更难建模的情况下不同意。我们试图解决分歧并扩展这些研究。我们以92种语言和大量的类型特征编译了145种圣经翻译的较大语料库。我们填写了几种语言的缺失类型数据，除了专家生产的类型学特征外，还考虑了基于语料库的形态复杂性的度量。我们发现，当LSTM模型接受BPE分段数据训练时，几种形态学测量与较高的惊喜显着相关。我们还研究了语言动机的子词细分策略，例如默菲斯和有限状态传感器（FSTS），发现这些细分策略会产生更好的性能，并减少语言形态对语言建模的影响。

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题