情绪识别系统的无监督个性化：语音外部化的独特属性

论文标题

情绪识别系统的无监督个性化：语音外部化的独特属性

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

论文作者

Sridhar, Kusha, Busso, Carlos

论文摘要

言语对价的预测是一个重要但充满挑战的问题。语音中价的外部化具有说话者依赖的提示，这有助于表现，这些表现通常明显低于预测其他情感属性（例如唤醒和统治力）。改善语音价预测的实用方法是将模型调整为测试集中的目标扬声器。将语音情感识别（SER）系统调整为特定的扬声器是一个困难的问题，尤其是在深度神经网络（DNNS）中，因为它需要优化数百万个参数。这项研究提出了一种无监督的方法来解决此问题，通过在火车组中搜索具有与测试集中的扬声器相似的声学模式的扬声器。来自选定扬声器的语音样本用于创建适应集。这种方法使用预训练的模型利用转移学习，这些模型与这些语音样本相适应。我们提出了三种替代适应策略：独特的演讲者，过度采样和加权方法。这些方法在价值模型的个性化中的适应设置的使用有所不同。结果表明，通过这些无监督的方法可以有效地对价预测模型进行个性化，从而导致相对改善高达13.52％。

The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题