论文标题
DNN扬声器跟踪带有嵌入
DNN Speaker Tracking with Embeddings
论文作者
论文摘要
在多扬声器中,应用程序通常是从注册扬声器的预计模型。使用这些模型来确定这些说话者干预录音的实例是说话者跟踪的任务。在本文中,我们提出了一种基于嵌入的扬声器跟踪方法。具体而言,我们的设计基于卷积神经网络,该网络模仿了典型的扬声器验证PLDA(概率线性判别分析)分类器,并发现目标扬声器以在线方式说明的区域。从两个不同的角度研究了该系统:诊断和跟踪;在相同的实验条件下,这两种结果均显示出对PLDA基线的显着改善。修改了两个标准的公共数据集,即Callhome和Dihard II单个通道,以创建具有重叠和非重叠区域的两个扬声器子集。我们通过不同片段长度产生的模型来评估监督方法的鲁棒性。 DIHARD II单个通道的DER相对改善表现出令人鼓舞的性能。此外,为了使基线系统类似于扬声器跟踪,将非目标扬声器添加到录音中。即使在这些不利条件下,我们的方法也足够强大,足以超越PLDA基线。
In multi-speaker applications is common to have pre-computed models from enrolled speakers. Using these models to identify the instances in which these speakers intervene in a recording is the task of speaker tracking. In this paper, we propose a novel embedding-based speaker tracking method. Specifically, our design is based on a convolutional neural network that mimics a typical speaker verification PLDA (probabilistic linear discriminant analysis) classifier and finds the regions uttered by the target speakers in an online fashion. The system was studied from two different perspectives: diarization and tracking; results on both show a significant improvement over the PLDA baseline under the same experimental conditions. Two standard public datasets, CALLHOME and DIHARD II single channel, were modified to create two-speaker subsets with overlapping and non-overlapping regions. We evaluate the robustness of our supervised approach with models generated from different segment lengths. A relative improvement of 17% in DER for DIHARD II single channel shows promising performance. Furthermore, to make the baseline system similar to speaker tracking, non-target speakers were added to the recordings. Even in these adverse conditions, our approach is robust enough to outperform the PLDA baseline.