论文标题

通过在复制运行中使用聚类技术评估其稳定性,提高潜在差异分配的可靠性

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

论文作者

Rieger, Jonas, Koppers, Lars, Jentsch, Carsten, Rahnenführer, Jörg

论文摘要

用于组织大型文本语料库主题建模提供有用的工具。一种广泛使用的方法是潜在的Dirichlet分配(LDA),这是一种生成概率模型,该模型在文本集合中建模单个文本作为潜在主题的混合物。单词对主题的分配依赖于初始值,以至于通常LDA的结果不完全可重现。此外,通过Gibbs采样的重新分配基于条件分布,从而在相同的文本数据上重复运行中导致不同的结果。在日常实践中,这个事实通常被忽略。我们旨在提高LDA结果的可靠性。因此,我们通过比较复制运行的分配来研究LDA的稳定性。我们建议通过修改的Jaccard系数量化两个生成的主题的相似性。使用这种相似性,可以将主题聚类。提出了一种基于两个LDA运行创建相似主题对的想法的层次聚类结果的新的修剪算法。这种方法导致通过{\ bf c} luster {\ bf lo} cal {\ bf p}运行{\ bf c} lustering的新测量s-clop({\ bf s}模仿性,以量化LDA模型的稳定性。我们讨论了此措施的一些特征,并将其应用于由\ textit {美国今日}的报纸文章组成的真实数据的应用。我们的结果表明,该度量S-CLOP对于评估LDA模型的稳定性或任何其他主题建模程序的稳定性很有用,该过程通过单词分布来表征其主题。根据新提出的LDA稳定性措施,我们提出了一种提高可靠性和基于主题建模基于经验发现的可重复性的方法。可靠性的提高是通过运行LDA多次运行并将其作为原型最具代表性运行而获得的,即LDA运行与所有其他运行的平均相似性最高。

For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源