多变量泊松型含量分布的典型家族，用于聚类多元计数数据

论文标题

多变量泊松型含量分布的典型家族，用于聚类多元计数数据

A parsimonious family of multivariate Poisson-lognormal distributions for clustering multivariate count data

论文作者

Subedi, Sanjeena, Browne, Ryan

论文摘要

多元计数数据通常是通过生物信息学，文本挖掘或体育分析中的高通量测序技术遇到的。尽管泊松分布似乎是这些计数数据的自然拟合，但其多元扩展在计算上是昂贵的。在大多数情况下，假定变量之间的相互独立性，但是这未能考虑到数据中通常观察到的变量之间的相关性。最近，多变量泊松孔认知（MPLN）模型的混合物已用于分析具有依赖性结构的多元计数测量值。在MPLN模型中，每个计数都是在潜在的多元高斯变量上有条件的独立泊松分布建模的。由于这种层次结构，MPLN模型可以解释过度分散，而不是传统的泊松分布，并允许变量之间的相关性。此处使用基于蒙特卡洛的估计框架，而基于计算的效率效率低下，基于快速的EM框架则用于参数估计。此外，通过分解协方差矩阵并对这些分解构成约束，提出了一个泊松含量分布的混合物的简约混合物家族。使用模拟和基准数据集显示了此类模型的实用程序。

Multivariate count data are commonly encountered through high-throughput sequencing technologies in bioinformatics, text mining, or in sports analytics. Although the Poisson distribution seems a natural fit to these count data, its multivariate extension is computationally expensive.In most cases mutual independence among the variables is assumed, however this fails to take into account the correlation among the variables usually observed in the data. Recently, mixtures of multivariate Poisson-lognormal (MPLN) models have been used to analyze such multivariate count measurements with a dependence structure. In the MPLN model, each count is modeled using an independent Poisson distribution conditional on a latent multivariate Gaussian variable. Due to this hierarchical structure, the MPLN model can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. Rather than relying on a Monte Carlo-based estimation framework which is computationally inefficient, a fast variational-EM based framework is used here for parameter estimation. Further, a parsimonious family of mixtures of Poisson-lognormal distributions are proposed by decomposing the covariance matrix and imposing constraints on these decompositions. Utility of such models is shown using simulated and benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题