L3Cube-Hingcorpus和Hingbert：代码混合印度英语数据集和Bert语言模型

论文标题

L3Cube-Hingcorpus和Hingbert：代码混合印度英语数据集和Bert语言模型

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

论文作者

Nayak, Ravindra, Joshi, Raviraj

论文摘要

当给定句子或对话中混合多种语言时，就会发生代码转换。在社交媒体平台上，这种现象更为突出，随着时间的流逝，其采用率正在增加。因此，在文献中已经对混合的NLP进行了广泛的研究。随着预训练的基于变压器的架构越来越受欢迎，我们观察到实际的代码混音数据很少，因此培训大型语言模型很少。我们提出了L3Cube-Hingcorpus，这是罗马脚本中的第一个大规模真实印度英语代码混合数据。它由从Twitter上刮下来的5293万句子和1.04b代币组成。我们进一步介绍了Hingbert，Hingmbert，Hingroberta和Hinggpt。 BERT模型已使用蒙版的语言建模目标进行了在CodeMix的Hingcorpus上进行了预训练。我们显示了这些BERT模型在随后的下游任务中的有效性，例如Gluecos基准的代码混合情感分析，POS标记，NER和LID。 Hinggpt是基于GPT2的生成变压器模型，能够生成完整的推文。我们还发布了L3Cube-Hinglid语料库，这是最大的代码混合印度语语言识别（LID）数据集和Hingbert-Lid，这是一种生产质量的盖子模型，可利用此工作中概述的过程促进捕获更多代码混合数据。数据集和模型可在https://github.com/l3cube-pune/code-mixed-nlp上找到。

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .

下载PDF全文

下载文献需遵守相关版权规定

论文标题