基于帧速率的可变数据增强，以处理自动扬声器验证的说话风格的可变性

论文标题

基于帧速率的可变数据增强，以处理自动扬声器验证的说话风格的可变性

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

论文作者

Afshan, Amber, Guo, Jinxi, Park, Soo Jin, Ravi, Vijay, McCree, Alan, Alwan, Abeer

论文摘要

使用UCLA扬声器变异性数据库研究了语言风格的变异性对自动扬声器验证的影响，该数据库包括每个扬声器的多种口语样式。使用标准增强技术的SRE和Theckboard数据库培训了X-VOVECTOR/PLDA（概率线性判别分析）系统，并通过UCLA数据库的话语进行了评估。当入学率和测试话语的样式相同时，相等的错误率（EER）较低（例如，阅读和对话语音分别为0.98％和0.57％），但是当入学和测试说法之间样式不匹配时，它大幅增加。例如，当在阅读，叙述和宠物指导的语音上测试时，EER注册时，EER增加到3.03％，2.96％和22.12％。为了降低样式不匹配的效果，我们提出了一种基于熵的可变帧速率技术，以人为地生成样式归一化表示以进行PLDA适应。提议的系统大大提高了性能。在上述条件下，EERS提高到2.69％（对话 - 阅读），2.27％（对话 - 叙述）和18.75％（宠物指导 - 读）。总体而言，该提议的技术与多式PLDA适应性相当，而无需以每位扬声器的方式培训数据。

The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database which comprises multiple speaking styles per speaker. An x-vector/PLDA (probabilistic linear discriminant analysis) system was trained with the SRE and Switchboard databases with standard augmentation techniques and evaluated with utterances from the UCLA database. The equal error rate (EER) was low when enrollment and test utterances were of the same style (e.g., 0.98% and 0.57% for read and conversational speech, respectively), but it increased substantially when styles were mismatched between enrollment and test utterances. For instance, when enrolled with conversation utterances, the EER increased to 3.03%, 2.96% and 22.12% when tested on read, narrative, and pet-directed speech, respectively. To reduce the effect of style mismatch, we propose an entropy-based variable frame rate technique to artificially generate style-normalized representations for PLDA adaptation. The proposed system significantly improved performance. In the aforementioned conditions, the EERs improved to 2.69% (conversation -- read), 2.27% (conversation -- narrative), and 18.75% (pet-directed -- read). Overall, the proposed technique performed comparably to multi-style PLDA adaptation without the need for training data in different speaking styles per speaker.

下载PDF全文

下载文献需遵守相关版权规定

论文标题