针对文本独立的说话者识别的基于DCT和基于DCT的全球上下文建模

论文标题

针对文本独立的说话者识别的基于DCT和基于DCT的全球上下文建模

Attention and DCT based Global Context Modeling for Text-independent Speaker Recognition

论文作者

Xia, Wei, Hansen, John H. L.

论文摘要

学习有效的说话者表示对于在说话者验证任务中实现可靠的性能至关重要。语音信号是每个时间频（TF）位置的高维，长和可变的序列。在相邻本地区域运行的标准卷积层通常无法捕获复杂的TF全局信息。我们的动机是通过提高建模能力，强调重要信息并抑制可能的冗余来减轻这些挑战。我们旨在通过纳入基于注意力机制的好处和基于离散的余弦转换（DCT）信号处理技术来设计更强大，更有效的说话者识别系统，以有效地代表语音信号中的全局信息。为了实现这一目标，我们为演讲者建模提供了一个一般的全球时频上下文建模块。首先，引入了基于注意力的上下文模型，以捕获不同时间频率位置的远程和非本地关系。其次，提出了一个基于2D-DCT的上下文模型，以提高模型效率并检查信号建模的好处。提出了一种多DCT注意机制，以通过替代DCT碱形式提高建模能力。最后，全局上下文信息用于通过计算全局上下文和本地特征之间的相似性来重新校准显着的时频位置。与标准的重新系统模型相比，这有效地提高了扬声器验证性能，并挤压了较大的边距。我们的实验结果表明，提出的全球上下文建模方法可以通过实现频道和时频功能重新校准来有效地改善学习的说话者表示。

Learning an effective speaker representation is crucial for achieving reliable performance in speaker verification tasks. Speech signals are high-dimensional, long, and variable-length sequences containing diverse information at each time-frequency (TF) location. The standard convolutional layer that operates on neighboring local regions often fails to capture the complex TF global information. Our motivation is to alleviate these challenges by increasing the modeling capacity, emphasizing significant information, and suppressing possible redundancies. We aim to design a more robust and efficient speaker recognition system by incorporating the benefits of attention mechanisms and Discrete Cosine Transform (DCT) based signal processing techniques, to effectively represent the global information in speech signals. To achieve this, we propose a general global time-frequency context modeling block for speaker modeling. First, an attention-based context model is introduced to capture the long-range and non-local relationship across different time-frequency locations. Second, a 2D-DCT based context model is proposed to improve model efficiency and examine the benefits of signal modeling. A multi-DCT attention mechanism is presented to improve modeling power with alternate DCT base forms. Finally, the global context information is used to recalibrate salient time-frequency locations by computing the similarity between the global context and local features. This effectively improves the speaker verification performance compared to the standard ResNet model and Squeeze & Excitation block by a large margin. Our experimental results show that the proposed global context modeling method can efficiently improve the learned speaker representations by achieving channel-wise and time-frequency feature recalibration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题