论文标题
使用随机块模型为大规模网络的分布式社区检测
Distributed Community Detection for Large Scale Networks Using Stochastic Block Model
论文作者
论文摘要
随着信息和技术的快速发展,大规模网络数据无处不在。在这项工作中,我们开发了一种分布式的光谱聚类算法,用于大规模网络中的社区检测。为了解决问题,我们在主服务器上分发L Pilot网络节点,而其他工人服务器上的其他则分发。首先在主上进行频谱聚类算法,以选择伪中心。然后将伪中心的索引广播给工人,以使用SVD类型算法完成分布式社区检测任务。拟议的分布式算法具有三个优点。首先,沟通成本很低,因为仅传达了伪中心的索引。其次,在工人上不需要进一步的迭代算法,因此它不会因初始化和非态度而遭受问题。第三,与使用整个邻接矩阵相比,计算复杂性和存储要求都要低得多。开发了Python软件包DCD(www.github.com/ikerlz/dcd),以实现Spark系统的分布式算法。根据估计准确性和错误群集率提供了理论特性。最后,通过各种合成和经验数据集的实验说明了所提出的方法的优势。
With rapid developments of information and technology, large scale network data are ubiquitous. In this work we develop a distributed spectral clustering algorithm for community detection in large scale networks. To handle the problem, we distribute l pilot network nodes on the master server and the others on worker servers. A spectral clustering algorithm is first conducted on the master to select pseudo centers. The indexes of the pseudo centers are then broadcasted to workers to complete distributed community detection task using a SVD type algorithm. The proposed distributed algorithm has three merits. First, the communication cost is low since only the indexes of pseudo centers are communicated. Second, no further iteration algorithm is needed on workers and hence it does not suffer from problems as initialization and non-robustness. Third, both the computational complexity and the storage requirements are much lower compared to using the whole adjacency matrix. A Python package DCD (www.github.com/Ikerlz/dcd) is developed to implement the distributed algorithm for a Spark system. Theoretical properties are provided with respect to the estimation accuracy and mis-clustering rates. Lastly, the advantages of the proposed methodology are illustrated by experiments on a variety of synthetic and empirical datasets.