开发多种语言的厌恶和侵略性语料库

论文标题

开发多种语言的厌恶和侵略性语料库

Developing a Multilingual Annotated Corpus of Misogyny and Aggression

论文作者

Bhattacharya, Shiladitya, Singh, Siddharth, Kumar, Ritesh, Bansal, Akanksha, Bhagat, Akash, Dawer, Yogesh, Lahiri, Bornini, Ojha, Atul Kr.

论文摘要

在本文中，我们讨论了印度英语，印地语和印度孟加拉的多种语言注释的厌恶和侵略语料库的发展，这是研究和自动在社交媒体上（逗号项目）自动识别厌女症和共产主义的项目的一部分。该数据集是从YouTube视频的评论中收集的，目前包含超过20,000条评论。评论在两个层面上进行注释 - 侵略性（明显的侵略性，侵略性，非侵略性）和厌女症（性别和非性别）。我们描述了数据收集的过程，用于注释的标签集以及注释过程中面临的问题和挑战。最后，我们讨论了为开发三种语言厌女症的分类器进行的基线实验的结果。

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题