用图形凹坑的逐态性重叠感知神经诊断

论文标题

用图形凹坑的逐态性重叠感知神经诊断

Utterance-by-utterance overlap-aware neural diarization with Graph-PIT

论文作者

Kinoshita, Keisuke, von Neumann, Thilo, Delcroix, Marc, Boeddeker, Christoph, Haeb-Umbach, Reinhold

论文摘要

最近的说话者诊断研究表明，端到端神经诊断（EEND）和基于聚类的诊断的整合是在各种任务上实现最新表现的有前途的方法。这种方法首先将观察到的信号划分为固定长度段，然后基于回旋模块执行{\ it semtem level}局部诊断，并通过聚类将片段级别的结果合并以形成最终的全局诊断结果。进行细分是为了限制每个段中的扬声器数量，因为当前的回旋无法处理大量扬声器。在本文中，我们认为这种涉及分割的方法有几个问题。例如，它不可避免地面临着一个困境，即更大的细分市场大小增加了可用的上下文，以提高性能和扬声器的数量来处理本地电子模块。为了解决这样的问题，本文提出了一个新的框架，该框架在不进行分割的情况下进行诊断。但是，它仍然可以处理包含许多说话者的具有挑战性的数据和大量重叠的语音。提出的方法可以进行整个会议进行推理，并执行{\ It tustance-by-otterance}诊断，从而将话语活动列为说话者。为此，我们利用了一种称为Graph-PIT的神经网络训练计划，该方案最近提出了用于神经源分离的图形。使用模拟的活动类似数据和呼叫者数据进行的实验显示了所提出的方法的优越性，而不是常规方法。

Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To resolve such a problem, this paper proposes a novel framework that performs diarization without segmentation. However, it can still handle challenging data containing many speakers and a significant amount of overlapping speech. The proposed method can take an entire meeting for inference and perform {\it utterance-by-utterance} diarization that clusters utterance activities in terms of speakers. To this end, we leverage a neural network training scheme called Graph-PIT proposed recently for neural source separation. Experiments with simulated active-meeting-like data and CALLHOME data show the superiority of the proposed approach over the conventional methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题