论文标题

Prabhupadavani:25种语言的代码混合语音翻译数据

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

论文作者

Sandhan, Jivnesh, Daksh, Ayush, Paranjay, Om Adideva, Behera, Laxmidhar, Goyal, Pawan

论文摘要

如今,对混合代码的兴趣已在自然语言处理(NLP)中变得普遍存在;但是,没有引起太多关注来解决这一现象以进行语音翻译(ST)任务。这完全可以归因于缺乏由代码混合的ST任务标记数据。因此,我们介绍了Prabhupadavani,这是一种用于25种语言的多语言代码混合ST数据集。它是多域的,涵盖了十个语言家庭,其中包含130多个说话者的94小时语音,手动与目标语言的相应文本保持一致。 Prabhupadavani是关于吠陀文化和遗产的文献,在文献中引用文学的情况下,在人文教学的背景下,代码转换很重要。据我们所知,Prabhupadvani是ST文献中第一个可用的多种语言代码混合ST数据集。该数据也可用于代码混合的机器翻译任务。所有数据集可以在https://github.com/frozentoad9/cmst上访问。

Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at https://github.com/frozentoad9/CMST.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源