论文标题
senmfk-split:通过语义非阴性矩阵分解的大型语料库主题建模与自动模型选择
SeNMFk-SPLIT: Large Corpora Topic Modeling by Semantic Non-negative Matrix Factorization with Automatic Model Selection
论文作者
论文摘要
随着文本数据数量的不断增长,主题建模在理解被压倒数量的文档所隐藏的内容中起着重要作用。一种流行的主题建模方法是非负矩阵分解(NMF),一种无监督的机器学习(ML)方法。最近,已提出具有自动模型选择(SENMFK)的语义NMF作为NMF的修改。除了启发估计主题的数量外,Senmfk还结合了文本的语义结构。这是通过将术语倒数文档频率(TF-IDF)矩阵与共发生/单词 - 封闭式矩阵共同分配来执行的,该矩阵的值代表文本预定窗口中两个单词共发生的次数。在本文中,我们介绍了一种新颖的分布式方法Senmfk-Split,以用于语义主题提取适合大型语料库。与SENMFK相反,我们的方法可以通过分别分别分解单词文档和文档矩阵来实现大量文档的联合分解。我们通过将其应用于上传到ARXIV上的整个人工智能(AI)和ML科学文献来证明SENMFK分类的能力。
As the amount of text data continues to grow, topic modeling is serving an important role in understanding the content hidden by the overwhelming quantity of documents. One popular topic modeling approach is non-negative matrix factorization (NMF), an unsupervised machine learning (ML) method. Recently, Semantic NMF with automatic model selection (SeNMFk) has been proposed as a modification to NMF. In addition to heuristically estimating the number of topics, SeNMFk also incorporates the semantic structure of the text. This is performed by jointly factorizing the term frequency-inverse document frequency (TF-IDF) matrix with the co-occurrence/word-context matrix, the values of which represent the number of times two words co-occur in a predetermined window of the text. In this paper, we introduce a novel distributed method, SeNMFk-SPLIT, for semantic topic extraction suitable for large corpora. Contrary to SeNMFk, our method enables the joint factorization of large documents by decomposing the word-context and term-document matrices separately. We demonstrate the capability of SeNMFk-SPLIT by applying it to the entire artificial intelligence (AI) and ML scientific literature uploaded on arXiv.