Prabhupadavani：25种语言的代码混合语音翻译数据

论文标题

Prabhupadavani：25种语言的代码混合语音翻译数据

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

论文作者

Sandhan, Jivnesh, Daksh, Ayush, Paranjay, Om Adideva, Behera, Laxmidhar, Goyal, Pawan

论文摘要

如今，对混合代码的兴趣已在自然语言处理（NLP）中变得普遍存在；但是，没有引起太多关注来解决这一现象以进行语音翻译（ST）任务。这完全可以归因于缺乏由代码混合的ST任务标记数据。因此，我们介绍了Prabhupadavani，这是一种用于25种语言的多语言代码混合ST数据集。它是多域的，涵盖了十个语言家庭，其中包含130多个说话者的94小时语音，手动与目标语言的相应文本保持一致。 Prabhupadavani是关于吠陀文化和遗产的文献，在文献中引用文学的情况下，在人文教学的背景下，代码转换很重要。据我们所知，Prabhupadvani是ST文献中第一个可用的多种语言代码混合ST数据集。该数据也可用于代码混合的机器翻译任务。所有数据集可以在https://github.com/frozentoad9/cmst上访问。

Nowadays, the interest in code-mixing has become ubiquitous in Natural Language Processing (NLP); however, not much attention has been given to address this phenomenon for Speech Translation (ST) task. This can be solely attributed to the lack of code-mixed ST task labelled data. Thus, we introduce Prabhupadavani, which is a multilingual code-mixed ST dataset for 25 languages. It is multi-domain, covers ten language families, containing 94 hours of speech by 130+ speakers, manually aligned with corresponding text in the target language. The Prabhupadavani is about Vedic culture and heritage from Indic literature, where code-switching in the case of quotation from literature is important in the context of humanities teaching. To the best of our knowledge, Prabhupadvani is the first multi-lingual code-mixed ST dataset available in the ST literature. This data also can be used for a code-mixed machine translation task. All the dataset can be accessed at https://github.com/frozentoad9/CMST.

下载PDF全文

下载文献需遵守相关版权规定

论文标题