论文标题
句法依赖距离的分布
The distribution of syntactic dependency distances
论文作者
论文摘要
句子的句法结构可以表示为图形,其中顶点是单词,边缘表示它们之间的句法依赖性。在这种情况下,两个链接单词之间的距离定义为其位置之间的差异。在这里,我们希望为句法依赖关系距离的实际分布的表征做出贡献,以前曾被认为遵循幂律分布。在这里,我们提出了一个具有两个指数状态的新模型,其中允许概率衰减在突破点后更改。这种过渡可以反映从单词块的处理到高级结构的过渡。我们发现,两个权重模型 - 第一个制度遵循指数衰减或幂律衰减 - 是我们考虑的所有20种语言中最有可能的一种,独立于句子的长度和注释样式。此外,突破点在语言和平均值中表现出较低的差异为4-5个单词,这表明可以同时处理从特定语言的摘要到高度的单词数量。概率衰减在断点后逐渐减慢,并始终如一地使用通用的块和通用机制。最后,根据最近引入的最佳分数,我们将说明最佳估计模型与句法依赖关系作为句子长度的函数之间的关系。
The syntactic structure of a sentence can be represented as a graph, where vertices are words and edges indicate syntactic dependencies between them. In this setting, the distance between two linked words is defined as the difference between their positions. Here we wish to contribute to the characterization of the actual distribution of syntactic dependency distances, which has previously been argued to follow a power-law distribution. Here we propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point. This transition could mirror the transition from the processing of word chunks to higher-level structures. We find that a two-regime model - where the first regime follows either an exponential or a power-law decay - is the most likely one in all 20 languages we considered, independently of sentence length and annotation style. Moreover, the break-point exhibits low variation across languages and averages values of 4-5 words, suggesting that the amount of words that can be simultaneously processed abstracts from the specific language to a high degree. The probability decay slows down after the breakpoint, consistently with a universal chunk-and-pass mechanism. Finally, we give an account of the relation between the best estimated model and the closeness of syntactic dependencies as function of sentence length, according to a recently introduced optimality score.