论文标题
polusa数据集:随着时间的流行而平衡的90万政治新闻文章
The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity
论文作者
论文摘要
涵盖政策问题的新闻文章是社会科学中的重要信息来源,也经常用于其他用例,例如培训NLP语言模型。为了从新闻的分析中获得有意义的见解,需要大型数据集代表现实世界的分布,例如,关于所包含的媒体的受欢迎程度,局部或跨时间。通常需要有关媒体出版商的政治倾向的信息,例如研究整个政治范围内新闻报道的差异,这是研究媒体偏见和相关社会问题时社会科学的主要用例之一。关于这些要求,现有的数据集存在重大缺陷,从而在研究社区中为创建数据集创建了多余而繁琐的努力。为了填补这一空白,我们提出了Polusa,该数据集代表了普通美国新闻消费者所感知的在线媒体格局。该数据集包含90万条文章,其中涵盖了2017年1月至2019年8月在2019年8月之间发表的政策主题,由代表政治范围的18个新闻媒体。每个出口都由其政治倾向标记,我们使用八个数据源的系统聚集来得出。新闻数据集在发布日期和出口受欢迎程度方面保持平衡。 Polusa使研究各种科目,例如媒体效果和政治党派。由于其尺寸,该数据集允许使用数据启动深度学习方法。
News articles covering policy issues are an essential source of information in the social sciences and are also frequently used for other use cases, e.g., to train NLP language models. To derive meaningful insights from the analysis of news, large datasets are required that represent real-world distributions, e.g., with respect to the contained outlets' popularity, topically, or across time. Information on the political leanings of media publishers is often needed, e.g., to study differences in news reporting across the political spectrum, which is one of the prime use cases in the social sciences when studying media bias and related societal issues. Concerning these requirements, existing datasets have major flaws, resulting in redundant and cumbersome effort in the research community for dataset creation. To fill this gap, we present POLUSA, a dataset that represents the online media landscape as perceived by an average US news consumer. The dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum. Each outlet is labeled by its political leaning, which we derive using a systematic aggregation of eight data sources. The news dataset is balanced with respect to publication date and outlet popularity. POLUSA enables studying a variety of subjects, e.g., media effects and political partisanship. Due to its size, the dataset allows to utilize data-intense deep learning methods.