简短文字的Biterms建模

论文标题

简短文字的Biterms建模

Bag of biterms modeling for short texts

论文作者

Tuan, Anh Phan, Tran, Bach, Huu, Thien Nguyen, Van, Linh Ngo, Than, Khoat

论文摘要

分析社交媒体的文本遇到了许多挑战，因为它们的短暂性，庞大和动态性的独特特征。短文本没有提供足够的上下文信息，从而导致传统统计模型的失败。此外，许多应用通常会面临大量和动态的短文，从而对当前的批处理学习算法造成了各种计算挑战。本文介绍了一个新颖的框架，即Babers建模（BBM），用于建模大规模，动态和短文本收集。 BBM由两种主要成分组成：（1）用来表示文档的Biterms（BOB）的概念，以及（2）帮助统计模型包含BOB的简单方法。我们的框架可以轻松地用于大量的概率模型，我们通过两个众所周知的模型演示了它的有用性：潜在的dirichlet分配（LDA）和层次dirichlet过程（HDP）。通过利用术语（单词）和比特（单词对），BBM的主要优点是：（1）它通过强调通过一袋Biterms的含义和同时发生来提高文档的长度，并使上下文变得更加连贯，（2）IT继承和学习algoriths的文字可以使其直接设计和在线设计。广泛的实验表明，BBM的表现优于几个最先进的模型。我们还指出，即使对于普通文本，BOB表示的表现都比传统表示形式（例如，单词袋，TF-IDF）更好。

Analyzing texts from social media encounters many challenges due to their unique characteristics of shortness, massiveness, and dynamic. Short texts do not provide enough context information, causing the failure of the traditional statistical models. Furthermore, many applications often face with massive and dynamic short texts, causing various computational challenges to the current batch learning algorithms. This paper presents a novel framework, namely Bag of Biterms Modeling (BBM), for modeling massive, dynamic, and short text collections. BBM comprises of two main ingredients: (1) the concept of Bag of Biterms (BoB) for representing documents, and (2) a simple way to help statistical models to include BoB. Our framework can be easily deployed for a large class of probabilistic models, and we demonstrate its usefulness with two well-known models: Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP). By exploiting both terms (words) and biterms (pairs of words), the major advantages of BBM are: (1) it enhances the length of the documents and makes the context more coherent by emphasizing the word connotation and co-occurrence via Bag of Biterms, (2) it inherits inference and learning algorithms from the primitive to make it straightforward to design online and streaming algorithms for short texts. Extensive experiments suggest that BBM outperforms several state-of-the-art models. We also point out that the BoB representation performs better than the traditional representations (e.g, Bag of Words, tf-idf) even for normal texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题