论文标题
要成为或不成为口头多字表达:寻求歧视特征
To Be or Not To Be a Verbal Multiword Expression: A Quest for Discriminating Features
论文作者
论文摘要
Mutiword表达式(MWES)的自动识别是语义导向下游应用程序的先决条件。此任务具有挑战性,因为MWE,尤其是口头上的任务(VMWES)表现出表面变异性。但是,这种可变性通常比常规(非VMWE)结构更受限制,这导致了各种可变性。我们使用此事实来确定可以在监督分类设置中使用的最佳特征集来求解VMWE识别的子问题:识别先前看到的VMWES的出现。令人惊讶的是,一种简单的基于自定义的功能选择方法比其他标准方法(例如卡方检验,信息增益或决策树)更有效。仅使用6个功能的最佳集合的SVM分类器,从法语看到的数据上的最新共享任务中优于最佳系统。
Automatic identification of mutiword expressions (MWEs) is a pre-requisite for semantically-oriented downstream applications. This task is challenging because MWEs, especially verbal ones (VMWEs), exhibit surface variability. However, this variability is usually more restricted than in regular (non-VMWE) constructions, which leads to various variability profiles. We use this fact to determine the optimal set of features which could be used in a supervised classification setting to solve a subproblem of VMWE identification: the identification of occurrences of previously seen VMWEs. Surprisingly, a simple custom frequency-based feature selection method proves more efficient than other standard methods such as Chi-squared test, information gain or decision trees. An SVM classifier using the optimal set of only 6 features outperforms the best systems from a recent shared task on the French seen data.