论文标题
Potrika:孟加拉语的原始报纸数据集,具有八个主题和五个属性
Potrika: Raw and Balanced Newspaper Datasets in the Bangla Language with Eight Topics and Five Attributes
论文作者
论文摘要
知识是人类和科学发展的核心。自然语言处理(NLP)允许自动分析和创建知识。数据是至关重要的NLP和机器学习成分。开放数据集的稀缺性是机器和深度学习研究中的一个众所周知的问题。英语和其他主要世界语言的文本NLP数据集很大程度上就是这种情况。对于孟加拉语,这种情况更具挑战性,而NLP研究的大型数据集的数量实际上为零。我们在本地展示了Potrika,这是一本大型单标签孟加拉新闻文章,文本数据集,该数据集策划了孟加拉国六个流行的在线新闻门户(Jugantor,Jaijaidin,Ittefaq,Kaler Kontho,Inqilab和Somoyer Alo)的NLP研究。这些文章分为八个不同的类别(国家,体育,国际,娱乐,经济,教育,政治和科学\&技术),提供了五个属性(新闻文章,类别,标题,出版日期和报纸来源)。 RAW数据集包含1.851亿个单词和1,257万个句子,其中包含664,880篇新闻文章。此外,使用NLP增强技术,我们从RAW(不平衡)数据集中创建另一个(平衡)数据集,其中包含320,000篇新闻文章,其中八个新闻类别中的每一个中有40,000篇文章。 Potrika既包含数据集(原始和平衡),以适合广泛的NLP研究。到目前为止,据我们所知,波特里卡是新闻分类的最大,最广泛的数据集。
Knowledge is central to human and scientific developments. Natural Language Processing (NLP) allows automated analysis and creation of knowledge. Data is a crucial NLP and machine learning ingredient. The scarcity of open datasets is a well-known problem in machine and deep learning research. This is very much the case for textual NLP datasets in English and other major world languages. For the Bangla language, the situation is even more challenging and the number of large datasets for NLP research is practically nil. We hereby present Potrika, a large single-label Bangla news article textual dataset curated for NLP research from six popular online news portals in Bangladesh (Jugantor, Jaijaidin, Ittefaq, Kaler Kontho, Inqilab, and Somoyer Alo) for the period 2014-2020. The articles are classified into eight distinct categories (National, Sports, International, Entertainment, Economy, Education, Politics, and Science \& Technology) providing five attributes (News Article, Category, Headline, Publication Date, and Newspaper Source). The raw dataset contains 185.51 million words and 12.57 million sentences contained in 664,880 news articles. Moreover, using NLP augmentation techniques, we create from the raw (unbalanced) dataset another (balanced) dataset comprising 320,000 news articles with 40,000 articles in each of the eight news categories. Potrika contains both the datasets (raw and balanced) to suit a wide range of NLP research. By far, to the best of our knowledge, Potrika is the largest and the most extensive dataset for news classification.