未知扬声器诊断的多目标提取器和检测器

论文标题

未知扬声器诊断的多目标提取器和检测器

Multi-target Extractor and Detector for Unknown-number Speaker Diarization

论文作者

Cheng, Chin-Yi, Lee, Hung-Shin, Tsao, Yu, Wang, Hsin-Min

论文摘要

目标扬声器的强烈表示可以帮助提取有关扬声器的重要信息，并在多演讲者对话中检测相应的时间区域。在这项研究中，我们提出了一种神经体系结构，该神经结构同时提取说话者表示与说话者诊断目标一致的说话者表示，并在逐帧的基础上检测到每个说话者的存在，而不管对话中的说话者的数量如何。说话者表示（称为z-vector）提取器和由剩余网络和处理时间尺寸和说话者尺寸处理数据实现的时间宣传器上下文化器都集成到统一的框架中。呼叫者语料库的测试表明，我们的模型的表现优于到目前为止提出的大多数方法。在一个更具挑战性的情况下的评估，同时的说话者在2到7的情况下表明，我们的模型在几个典型基准的情况下达到了6.4％至30.9％的相对诊断错误率降低。

Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题