使用细心的渠道相关性和标签平滑的模型，基于语音的情感识别

论文标题

使用细心的渠道相关性和标签平滑的模型，基于语音的情感识别

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

论文作者

Kakouros, Sofoklis, Stafylakis, Themos, Mosner, Ladislav, Burget, Lukas

论文摘要

当识别语音中的情绪时，我们会遇到两个常见的问题：如何从语音信号中最佳捕获与情绪相关的信息以及如何最好地量化或分类嘈杂的主观情感标签。自我监督的预训练的表示可以从语音中稳健地捕获信息，从而使最先进的结果导致许多下游任务，包括情绪识别。但是，由于相关的情感信息可能会分散而不是在整个信号上均匀地出现，因此需要考虑跨时间汇总信息的更好方法。对于标签，我们需要考虑到主观人类注释产生的很大程度的噪音。在本文中，我们提出了一种基于表示系数之间的相关性与标签平滑的相关性的新方法来进行汇总，这是一种旨在降低分类器对培训标签的信心的方法。我们在基准数据集IEmocap上评估了我们提出的方法，并在文献中证明了高性能超过该方法。重现结果的代码可在github.com/skakouros/s3prl_attentive_corralation上获得。

When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognition. However, better ways of aggregating the information across time need to be considered as the relevant emotion information is likely to appear piecewise and not uniformly across the signal. For the labels, we need to take into account that there is a substantial degree of noise that comes from the subjective human annotations. In this paper, we propose a novel approach to attentive pooling based on correlations between the representations' coefficients combined with label smoothing, a method aiming to reduce the confidence of the classifier on the training labels. We evaluate our proposed approach on the benchmark dataset IEMOCAP, and demonstrate high performance surpassing that in the literature. The code to reproduce the results is available at github.com/skakouros/s3prl_attentive_correlation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题