论文标题
提高K-均值算法的性能
Improving The Performance Of The K-means Algorithm
论文作者
论文摘要
引入了增量K-均值(IKM),这是一个改进的K-均值(KM)的版本,以显着提高KM的聚类质量。但是,IKM的速度比公里慢。我的论文提出了两种算法,以加快IKM的速度,同时大约保持其聚类结果的质量。第一种称为分裂K-均值的算法通过加快其簇的分裂过程来提高IKM的速度。使用UCI机器学习数据集测试,新算法可以实现经验上的全球最佳效果,并且具有较低的复杂性,$ O(k*log_ {2} k*n)$,而不是IKM,$ O(k^{2} n)$。第二种算法称为平行的两相K-均值(PAR2PK均值),通过使用两相K-均值的模型来平行IKM。通过大型数据集进行测试,该算法达到了良好的加速比,结束了线性加速比。
The Incremental K-means (IKM), an improved version of K-means (KM), was introduced to improve the clustering quality of KM significantly. However, the speed of IKM is slower than KM. My thesis proposes two algorithms to speed up IKM while remaining the quality of its clustering result approximately. The first algorithm, called Divisive K-means, improves the speed of IKM by speeding up its splitting process of clusters. Testing with UCI Machine Learning data sets, the new algorithm achieves the empirically global optimum as IKM and has lower complexity, $O(k*log_{2}k*n)$, than IKM, $O(k^{2}n)$. The second algorithm, called Parallel Two-Phase K-means (Par2PK-means), parallelizes IKM by employing the model of Two-Phase K-means. Testing with large data sets, this algorithm attains a good speedup ratio, closing to the linearly speed-up ratio.