论文标题

Banglanlg和Banglat5:评估孟加拉低资源自然语言生成的基准和资源

BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla

论文作者

Bhattacharjee, Abhik, Hasan, Tahmid, Ahmad, Wasi Uddin, Shahriyar, Rifat

论文摘要

这项工作介绍了Banglanlg,这是一种评估孟加拉国自然语言生成(NLG)模型的综合基准,该模型是一种广泛使用但低资源的语言。我们在Banglanlg基准下汇总了六项具有挑战性的有条件文本生成任务,并在此过程中引入了有关对话生成的新数据集。此外,使用27.5 GB的孟加拉数据的干净语料库,我们预先列出了Banglat5,这是Bangla的序列到序列变压器语言模型。 Banglat5在所有这些任务中都实现了最先进的性能,表现优于多种多语言模型的绝对增益高达9%和32%的相对增益。我们正在https://github.com/csebuetnlp/banglanlg公开提供新的对话数据集和Banglat5模型,以期推进对Bangla NLG的未来研究。

This work presents BanglaNLG, a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain BanglaT5, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9% absolute gain and 32% relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源