小鼠：带有上下文嵌入的采矿成语

论文标题

小鼠：带有上下文嵌入的采矿成语

MICE: Mining Idioms with Contextual Embeddings

论文作者

Škvorc, Tadej, Gantar, Polona, Robnik-Šikonja, Marko

论文摘要

对于自然语言处理应用程序，惯用表达可能是有问题的，因为不能从构成单词中推断出其含义。缺乏成功的方法论方法和足够大的数据集可阻止开发机器学习方法来检测成语，尤其是对于训练集中未发生的表达式。我们提出了一种称为小鼠的方法，该方法为此目的使用上下文嵌入。我们提出了具有字面意义和惯用含义的多字表达式的新数据集，并使用它来训练基于两个最新的上下文嵌入的分类器：Elmo和Bert。我们表明，使用这两种嵌入的深度神经网络的表现都比现有方法好得多，并且能够检测惯用单词的使用，即使对于训练集中不存在的表达式也是如此。我们演示了开发模型的跨语性转移，并分析了所需数据集的大小。

Idiomatic expressions can be problematic for natural language processing applications as their meaning cannot be inferred from their constituting words. A lack of successful methodological approaches and sufficiently large datasets prevents the development of machine learning approaches for detecting idioms, especially for expressions that do not occur in the training set. We present an approach, called MICE, that uses contextual embeddings for that purpose. We present a new dataset of multi-word expressions with literal and idiomatic meanings and use it to train a classifier based on two state-of-the-art contextual word embeddings: ELMo and BERT. We show that deep neural networks using both embeddings perform much better than existing approaches, and are capable of detecting idiomatic word use, even for expressions that were not present in the training set. We demonstrate cross-lingual transfer of developed models and analyze the size of the required dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题