论文标题
当自动语音伪装符合自动扬声器验证时
When Automatic Voice Disguise Meets Automatic Speaker Verification
论文作者
论文摘要
转换声音以隐藏扬声器的真实身份的技术称为语音伪装,其中通过修改具有杂项算法的声音的光谱和时间特征来自动语音伪装(AVD),可以轻松地使用公众访问的Softwares来进行。 AVD对人类聆听和自动扬声器验证(ASV)构成了巨大威胁。在本文中,我们发现ASV不仅是AVD的受害者,而且可能是击败一些简单类型的AVD的工具。首先,引入了三种类型的AVD,音高缩放,声道长度归一化(VTLN)和语音转换(VC)作为代表性方法。随后使用最新的ASV方法来客观地评估AVD对ASV对ASV的影响(EER)。此外,通过最大程度地降低ASV分数W.R.T.的函数,提出了将变相的语音恢复到其原始版本的方法。恢复参数。然后,在Voxceleb的伪装声音上进行实验,Voxceleb是一个记录在现实世界噪声场景中的数据集。结果表明,对于通过音高缩放掩饰的语音掩饰,所提出的方法获得了约7%的EER,与使用基本频率的比率相比,与最近提出的基线的30%EER相比。提出的方法通过将其EER从34.3%降低到18.5%,从而很好地概括了通过VTLN中非线性频率翘曲恢复伪装的。但是,很难通过我们的方法恢复VC中的源说话者,在这种方法中,更复杂的恢复功能或其他副语言提示可能需要恢复VC中的非线性变换。最后,在有或没有恢复的情况下,对ASV特征的对比可视化以直观的方式说明了所提出的方法的作用。
The technique of transforming voices in order to hide the real identity of a speaker is called voice disguise, among which automatic voice disguise (AVD) by modifying the spectral and temporal characteristics of voices with miscellaneous algorithms are easily conducted with softwares accessible to the public. AVD has posed great threat to both human listening and automatic speaker verification (ASV). In this paper, we have found that ASV is not only a victim of AVD but could be a tool to beat some simple types of AVD. Firstly, three types of AVD, pitch scaling, vocal tract length normalization (VTLN) and voice conversion (VC), are introduced as representative methods. State-of-the-art ASV methods are subsequently utilized to objectively evaluate the impact of AVD on ASV by equal error rates (EER). Moreover, an approach to restore disguised voice to its original version is proposed by minimizing a function of ASV scores w.r.t. restoration parameters. Experiments are then conducted on disguised voices from Voxceleb, a dataset recorded in real-world noisy scenario. The results have shown that, for the voice disguise by pitch scaling, the proposed approach obtains an EER around 7% comparing to the 30% EER of a recently proposed baseline using the ratio of fundamental frequencies. The proposed approach generalizes well to restore the disguise with nonlinear frequency warping in VTLN by reducing its EER from 34.3% to 18.5%. However, it is difficult to restore the source speakers in VC by our approach, where more complex forms of restoration functions or other paralinguistic cues might be necessary to restore the nonlinear transform in VC. Finally, contrastive visualization on ASV features with and without restoration illustrate the role of the proposed approach in an intuitive way.