论文标题

失落的Kek的突袭者:从政治上不正确的董事会提高了3。5年的增强4chan职位

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

论文作者

Papasavva, Antonis, Zannettou, Savvas, De Cristofaro, Emiliano, Stringhini, Gianluca, Blackburn, Jeremy

论文摘要

本文介绍了一个数据集,该数据集在Imageboard论坛4chan的政治上不正确的董事会(/pol/)的帖子中介绍了超过3.5年(2016年6月至2019年11月至2019年6月)。据我们所知,这代表了最大的公开可用的4chan数据集,为社区提供了已永久从4chan删除的帖子档案,否则无法访问。我们使用一组其他标签来增强数据,包括毒性得分和每个帖子中提到的指定实体。我们还提供了对数据集的统计分析,概述了有兴趣使用它的研究人员可以期望的,以及简单的内容分析,阐明了最突出的讨论主题,提到的最受欢迎的实体以及每个帖子的毒性水平。总体而言,我们有信心我们的工作将激励和帮助研究人员研究和理解4chan及其在更大网络中的作用。例如,我们希望该数据集可用于社交媒体的跨平台研究,并且对于其他类型的研究(例如自然语言处理)有用。最后,我们的数据集可以协助定性工作,重点关注对特定叙述,事件或社会理论的深入案例研究。

This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源