论文标题
L3Cube-Hingcorpus和Hingbert:代码混合印度英语数据集和Bert语言模型
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models
论文作者
论文摘要
当给定句子或对话中混合多种语言时,就会发生代码转换。在社交媒体平台上,这种现象更为突出,随着时间的流逝,其采用率正在增加。因此,在文献中已经对混合的NLP进行了广泛的研究。随着预训练的基于变压器的架构越来越受欢迎,我们观察到实际的代码混音数据很少,因此培训大型语言模型很少。我们提出了L3Cube-Hingcorpus,这是罗马脚本中的第一个大规模真实印度英语代码混合数据。它由从Twitter上刮下来的5293万句子和1.04b代币组成。我们进一步介绍了Hingbert,Hingmbert,Hingroberta和Hinggpt。 BERT模型已使用蒙版的语言建模目标进行了在CodeMix的Hingcorpus上进行了预训练。我们显示了这些BERT模型在随后的下游任务中的有效性,例如Gluecos基准的代码混合情感分析,POS标记,NER和LID。 Hinggpt是基于GPT2的生成变压器模型,能够生成完整的推文。我们还发布了L3Cube-Hinglid语料库,这是最大的代码混合印度语语言识别(LID)数据集和Hingbert-Lid,这是一种生产质量的盖子模型,可利用此工作中概述的过程促进捕获更多代码混合数据。数据集和模型可在https://github.com/l3cube-pune/code-mixed-nlp上找到。
Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .