一致的基于模型的聚类：使用Quasi-Bernoulli破坏过程

论文标题

一致的基于模型的聚类：使用Quasi-Bernoulli破坏过程

Consistent Model-based Clustering: using the Quasi-Bernoulli Stick-Breaking Process

论文作者

Zeng, Cheng, Miller, Jeffrey W., Duan, Leo L.

论文摘要

在混合建模和聚类应用中，组件和簇的数量通常不知道。破坏性的混合模型（例如Dirichlet工艺混合模型）是一种吸引人的结构，它假设了无限的许多组件，同时将大多数未使用组件的重量缩小到接近零。但是，众所周知，这种收缩不足是不足的：即使正确指定了组件分布，出现虚假的权重并给出不一致的簇数量。在本文中，我们提出了一个简单的解决方案：将每个混合物重量棒分成两块时，将第二件零件的长度乘以准Bernoulli随机变量，将值一个或一个小的常数接近零。这有效地产生了软截断，并进一步缩小了未使用的重量。渐近地，我们表明，只要这种小的常数以比$ o（1/n^2）$更快的速度减小到零，而$ n $的样本大小，后验分布就会收敛到真实数量的集群数。相比之下，我们使用浓度参数进行严格探索dirichlet工艺混合模型，该参数是恒定或迅速减小到零的浓度参数 - 这两者都导致簇数量不一致。我们提出的模型易于实现，仅需要对混合模型的标准Gibbs采样器进行少量修改。在聚类大脑网络的模拟和数据应用中，我们提出的方法恢复了簇的基数，并导致少量簇。

In mixture modeling and clustering applications, the number of components and clusters is often not known. A stick-breaking mixture model, such as the Dirichlet process mixture model, is an appealing construction that assumes infinitely many components, while shrinking the weights of most of the unused components to near zero. However, it is well-known that this shrinkage is inadequate: even when the component distribution is correctly specified, spurious weights appear and give an inconsistent estimate of the number of clusters. In this article, we propose a simple solution: when breaking each mixture weight stick into two pieces, the length of the second piece is multiplied by a quasi-Bernoulli random variable, taking value one or a small constant close to zero. This effectively creates a soft-truncation and further shrinks the unused weights. Asymptotically, we show that as long as this small constant diminishes to zero at a rate faster than $o(1/n^2)$, with $n$ the sample size, the posterior distribution will converge to the true number of clusters. In comparison, we rigorously explore Dirichlet process mixture models using a concentration parameter that is either constant or rapidly diminishes to zero -- both of which lead to inconsistency for the number of clusters. Our proposed model is easy to implement, requiring only a small modification of a standard Gibbs sampler for mixture models. In simulations and a data application of clustering brain networks, our proposed method recovers the ground-truth number of clusters, and leads to a small number of clusters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题