数据和知识驱动的多种语言培训方法，以提高印度语言语音识别系统的性能

论文标题

数据和知识驱动的多种语言培训方法，以提高印度语言语音识别系统的性能

Data and knowledge-driven approaches for multilingual training to improve the performance of speech recognition systems of Indian languages

论文作者

Madhavaraj, A., Ganesan, Ramakrishnan Angarai

论文摘要

我们通过从多种源语言汇总语音数据来对目标语言的自动语音识别系统进行多语言识别系统的多语言培训提出了数据和知识驱动的方法。利用印度语言之间的声学相似性，我们实施了两种方法。在电话/Senone映射中，深神经网络（DNN）学会了将Senones或手机从一种语言映射到另一种语言，并且对源语言的转录进行了修改，以便可以与目标语言数据一起使用，以训练和调整目标语言ASR系统。在另一种方法中，我们通过训练多任务DNN（MTDNN）来同时对所有语言的声学信息进行建模，以预测不同输出层中每种语言的参议员。修改了横向渗透损失和权重更新过程，以便在培训期间（如果特征向量属于该特定语言）在培训期间更新语言的SENONE类别的共享层和输出层。在低资源环境（LRS）中，泰米尔语，泰卢固语和古吉拉特语的每份40小时的转录数据用于培训。基于DNN的SENONE映射技术在泰米尔语，古吉拉特语和泰卢固语语言的基线系统中，单词错误率（WER）的相对提高（WER）分别为9.66％，7.2％和15.21％。在中等资源的设置（MRS）中，使用了160、275和135小时的泰米尔语，卡纳达语和印地语语言的数据，在此，相同的技术分别为泰米尔语，卡纳达语和印地语提供了相同的相对相对改善，分别提供了13.94％，10.28％和27.24％。 MTDNN具有基于Senone映射的LRS培训，对泰米尔语，古吉拉特语和泰卢固语的相对相对提高分别为15.0％，17.54％和16.06％，而MRS，在MRS中，Tamil，Kannada和Hindi的MRS中的21.24％21.24％和30.17％和30.17％和30.17％的进步率为21.24％和30.17％。

We propose data and knowledge-driven approaches for multilingual training of the automated speech recognition (ASR) system for a target language by pooling speech data from multiple source languages. Exploiting the acoustic similarities between Indian languages, we implement two approaches. In phone/senone mapping, deep neural network (DNN) learns to map senones or phones from one language to the others, and the transcriptions of the source languages are modified such that they can be used along with the target language data to train and fine-tune the target language ASR system. In the other approach, we model the acoustic information for all the languages simultaneously by training a multitask DNN (MTDNN) to predict the senones of each language in different output layers. The cross-entropy loss and the weight update procedure are modified such that only the shared layers and the output layer responsible for predicting the senone classes of a language are updated during training, if the feature vector belongs to that particular language. In the low-resource setting (LRS), 40 hours of transcribed data each for Tamil, Telugu and Gujarati languages are used for training. The DNN based senone mapping technique gives relative improvements in word error rates (WER) of 9.66%, 7.2% and 15.21% over the baseline system for Tamil, Gujarati and Telugu languages, respectively. In medium-resourced setting (MRS), 160, 275 and 135 hours of data for Tamil, Kannada and Hindi languages are used, where, the same technique gives better relative improvements of 13.94%, 10.28% and 27.24% for Tamil, Kannada and Hindi, respectively. The MTDNN with senone mapping based training in LRS, gives higher relative WER improvements of 15.0%, 17.54% and 16.06%, respectively for Tamil, Gujarati and Telugu, whereas in MRS, we see improvements of 21.24% 21.05% and 30.17% for Tamil, Kannada and Hindi languages, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题