代码混合的Malayalam-English的情感分析数据集

论文标题

代码混合的Malayalam-English的情感分析数据集

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

论文作者

Chakravarthi, Bharathi Raja, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.

论文摘要

对社交媒体的文本分析的需求越来越不断增加，这些文本大多是代码混合的。由于文本的不同级别混合的复杂性，因此对单语数据进行了培训的单语言数据失败。但是，对于混合数据，很少有资源可用于创建针对此数据的特定模型。尽管对多语言和跨语性情感分析的大量研究使用了半监督或无监督的方法，但监督方法的性能仍然更好。仅提供少量用于流行语言的数据集，例如英语 - 西班牙语，英语印地语和英文 - 中国语言。 Malayalam-English代码混合数据没有可用的资源。本文提出了一种新的黄金标准语料库，用于在Malayalam-English中对由自愿注释者注释的Malayalam-English中的情感分析。对于数据集，该金标准语料库获得了高于0.8的Krippendorff的alpha。我们使用这个新的语料库为马拉雅拉姆语代码混合文本中的情感分析提供基准。

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题