基于F0信息和真实加上假想频谱图的组合，音频深击检测

论文标题

基于F0信息和真实加上假想频谱图的组合，音频深击检测

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

论文作者

Xue, Jun, Fan, Cunhang, Lv, Zhao, Tao, Jianhua, Yi, Jiangyan, Zheng, Chengshi, Wen, Zhengqi, Yuan, Minmin, Shao, Shegang

论文摘要

最近，先驱研究工作提出了大量的声学特征（对数功率谱图，线性频率sepstral系数，恒定的Q cepstral系数等），以进行音频深层检测，获得良好的性能，并表明不同的子带对音频深击检测有不同的贡献。但是，这缺乏对子带中特定信息的解释，这些功能也丢失了诸如阶段之类的信息。受合成语音机制的启发，基本频率（F0）信息用于提高综合语音的质量，而合成语音的F0仍然太平均，这与真实语音的质量显着不同。预计F0可以用作重要信息来区分真正的语言和假语音，而由于F0的分布不规则，因此不能直接使用此信息。相反，选择包含大多数F0的频带作为输入特征。同时，为了充分利用相位和全频段信息，我们还建议将真实和虚构的频谱图作为互补输入特征，并分别建模分离的分离子带。最后，融合了F0的结果，真实和假想的频谱图。 ASVSPOOF 2019 LA数据集的实验结果表明，我们所提出的系统对于音频DeepFake检测任务非常有效，达到等效错误率（EER）为0.43％，几乎超过了所有系统。

Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately. Finally, the results of F0, real and imaginary spectrogram features are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题