隐性模型，潜在压缩，内在偏见和廉价午餐

论文标题

隐性模型，潜在压缩，内在偏见和廉价午餐

Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection

论文作者

Peixoto, Tiago P., Kirkley, Alec

论文摘要

旨在将网络划分为节点群以总结其大规模结构的旨在将网络划分为大规模结构的任务，它催生了许多具有不同目标的竞争算法的开发。某些社区检测方法是推理的，通过概率生成模型明确地得出了聚类目标，而其他方法则是描述性的，根据特定应用程序动机的目标对网络进行划分，这使得以相同规模比较这些方法具有挑战性。在这里，我们提出了一个解决这个问题的解决方案，该解决方案将任何社区检测目标（推断或描述性）与相应的隐式网络生成模型相关联。这使我们能够计算网络及其在任意目标下的分区的描述长度，从而提供了一种原则性的措施来比较不同算法的性能，而无需“地面真相”标签。我们的方法还可以访问对任何给定算法最佳的社区检测问题的实例，并以这种方式揭示了流行描述性方法的内在偏见，从而解释了它们过度合适的趋势。使用我们的框架，我们比较了人造网络上的许多社区检测方法，以及在500多个结构上多样化的经验网络上进行了比较。我们发现，更具表现力的社区检测方法在结构化数据实例上表现出始终如一的出色压缩性能，而没有在更专业的算法最佳性能的少数情况下降低性能。我们的结果破坏了“无免费午餐”定理对社区检测的含义，无论是在概念上还是在实践上，因为它仅限于非结构化的数据实例，这与相关的社区检测问题不同，这些问题是根据要求构成的。

The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题