论文标题
使用修改的跨凝结损失的多语言仇恨言论和进攻性内容检测
Multilingual Hate Speech and Offensive Content Detection using Modified Cross-entropy Loss
论文作者
论文摘要
社交媒体用户增加的数量导致很多人滥用这些平台来传播进攻性内容并使用仇恨言论。手动跟踪大量帖子是不切实际的,因此有必要设计自动化方法以快速识别它们。大型语言模型经过大量数据培训,它们还利用上下文嵌入。我们微调大型语言模型以帮助我们的任务。数据也非常不平衡;因此,我们使用了修改后的跨凝结损失来解决该问题。我们观察到,使用印地语语料库中微调的模型的性能更好。我们的团队(HNLP)的宏F1得分分别为0.808,英语子任务为0.639和英语子任务B。对于印地语子任务A,印地语子任务B我们的团队在Hasoc 2021中分别达到0.737,0.737,0.443。
The number of increased social media users has led to a lot of people misusing these platforms to spread offensive content and use hate speech. Manual tracking the vast amount of posts is impractical so it is necessary to devise automated methods to identify them quickly. Large language models are trained on a lot of data and they also make use of contextual embeddings. We fine-tune the large language models to help in our task. The data is also quite unbalanced; so we used a modified cross-entropy loss to tackle the issue. We observed that using a model which is fine-tuned in hindi corpora performs better. Our team (HNLP) achieved the macro F1-scores of 0.808, 0.639 in English Subtask A and English Subtask B respectively. For Hindi Subtask A, Hindi Subtask B our team achieved macro F1-scores of 0.737, 0.443 respectively in HASOC 2021.