使用特征金字塔模块改进多尺度聚合，以验证可变式话语的稳健扬声器

论文标题

使用特征金字塔模块改进多尺度聚合，以验证可变式话语的稳健扬声器

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

论文作者

Jung, Youngmoon, Kye, Seong Min, Choi, Yeunju, Jung, Myunghun, Kim, Hoirin

论文摘要

当前，用于扬声器验证的最广泛使用的方法是嵌入学习的深度扬声器。在这种方法中，我们通过汇总从扬声器特征提取器的最后一层提取的单尺度特征来获得扬声器嵌入矢量。最近引入了多个尺度聚合（MSA），该聚合利用了功能提取器的不同层的多尺度特征，并已引入并显示出可变效率的卓越性能。为了增加对任意持续时间的话语的鲁棒性，本文通过使用特征金字塔模块来改善MSA。该模块通过自上而下的路径和横向连接增强了多层特征的扬声器歧义信息。我们使用包含不同时间尺度的丰富扬声器信息的增强功能提取扬声器嵌入。 Voxceleb数据集上的实验表明，所提出的模块改善了使用较少参数的MSA方法。对于简短话语和长期话语，它也比最先进的方法取得了更好的性能。

Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of parameters. It also achieves better performance than state-of-the-art approaches for both short and long utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题