论文标题

阿拉伯文字采矿

Arabic Text Mining

论文作者

AL-Ghuribi, Sumaia Mohammed, Noah, Shahrul Azman Mohd

论文摘要

互联网的快速增长增加了在线文本的数量。这导致了阿拉伯语的在线文本数量的迅速增长。必须将大量文本组织成类,以使分析过程和文本检索更加容易。因此,文本分类是文本挖掘的关键组成部分。在英语,欧洲(法语,德语,西班牙语)和亚洲(中文,日语)中,有许多用于对文学进行分类的系统和方法。相比之下,由于阿拉伯语的困难,关于对阿拉伯文学进行分类的研究相对较少。在这项工作中,引入了与阿拉伯文本挖掘相关的关键思想的简要说明,然后使用光茎和分类器幼稚的贝叶斯(CNB)提出了阿拉伯语的新分类系统。来自两个类别的文本:政治和体育,包括在我们的语料库中。一些文本被添加到系统中,系统将其正确分类,以证明系统的有效性。

The rapid growth of the internet has increased the number of online texts. This led to the rapid growth of the number of online texts in the Arabic language. The enormous amount of text must be organized into classes to make the analysis process and text retrieval easier. Text classification is, therefore, a key component of text mining. There are numerous systems and approaches for categorizing literature in English, European (French, German, Spanish), and Asian (Chinese, Japanese). In contrast, there are relatively few studies on categorizing Arabic literature due to the difficulty of the Arabic language. In this work, a brief explanation of key ideas relevant to Arabic text mining are introduced then a new classification system for the Arabic language is presented using light stemming and Classifier Naïve Bayesian (CNB). Texts from two classes: politics and sports, are included in our corpus. Some texts are added to the system, and the system correctly classified them, demonstrating the effectiveness of the system.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源