论文标题
使用图神经网络嵌入会话级扬声器的扬声器诊断
Speaker diarization with session-level speaker embedding refinement using graph neural networks
论文作者
论文摘要
深扬声器嵌入模型通常被用作扬声器诊断系统的基础。但是,通常会根据培训数据定义的全球损失进行培训,这可能是在特定的会议课程中在本地区分扬声器的次级训练。在这项工作中,我们介绍了使用GNN在每个会话中的语音段之间的结构信息,使用GNN首次使用图形神经网络(GNN)来提炼扬声器嵌入本地。由预训练模型提取的扬声器嵌入式被重新映射到新的嵌入空间中,其中单个会话中的不同扬声器可以更好地分开。通过最大程度地减少精制嵌入构造的亲和力矩阵与地面真实邻接矩阵构建的亲和力矩阵之间的差异,以监督方式训练该模型进行连锁预测。然后将光谱聚类应用于精制嵌入的顶部。我们表明,精制扬声器嵌入的聚类性能在模拟和真实会议数据上都显着优于原始嵌入,而我们的系统实现了NIST SRE 2000 Callhome数据库的最新结果。
Deep speaker embedding models have been commonly used as a building block for speaker diarization systems; however, the speaker embedding model is usually trained according to a global loss defined on the training data, which could be sub-optimal for distinguishing speakers locally in a specific meeting session. In this work we present the first use of graph neural networks (GNNs) for the speaker diarization problem, utilizing a GNN to refine speaker embeddings locally using the structural information between speech segments inside each session. The speaker embeddings extracted by a pre-trained model are remapped into a new embedding space, in which the different speakers within a single session are better separated. The model is trained for linkage prediction in a supervised manner by minimizing the difference between the affinity matrix constructed by the refined embeddings and the ground-truth adjacency matrix. Spectral clustering is then applied on top of the refined embeddings. We show that the clustering performance of the refined speaker embeddings outperforms the original embeddings significantly on both simulated and real meeting data, and our system achieves the state-of-the-art result on the NIST SRE 2000 CALLHOME database.