论文标题
DeepMSRF:一个新颖的深型多式联运扬声器识别框架,具有特征选择
DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection
论文作者
论文摘要
为了在视频流中识别说话者,已经进行了大量的研究,以通过提取高级说话者的特征(例如面部表情,情感和性别)来获得丰富的机器学习模型。但是,通过仅使用单个模态特征提取器来利用从视频流中提取的音频信号或图像帧,生成这种模型是不可行的。在本文中,我们从不同的角度解决了这个问题,并提出了一个前所未有的多模式数据融合框架,称为DeepMSRF,Deep Multopal扬声器识别功能选择。我们通过喂养两种方式的功能,即扬声器的音频和面部图像来执行DEEPMSRF。 DeepMSRF使用两流VGGNET对两种模式进行训练,以达到能够准确识别说话者身份的综合模型。我们将DEEPMSRF应用于Voxceleb2数据集的子集及其与VGGFACE2数据集合并的元数据。 DeepMSRF的目标是首先确定演讲者的性别,并进一步认识到他或她的任何给定视频流的名字。实验结果表明,DEEPMSRF的表现优于至少3%精度的单形式扬声器识别方法。
For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of VoxCeleb2 dataset with its metadata merged with VGGFace2 dataset. The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream. The experimental results illustrate that DeepMSRF outperforms single modality speaker recognition methods with at least 3 percent accuracy.