共同微调“伯特式”自我监督模型，以改善多模式的语音情感识别

论文标题

共同微调“伯特式”自我监督模型，以改善多模式的语音情感识别

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

论文作者

Siriwardhana, Shamane, Reis, Andrew, Weerasekera, Rivindu, Nanayakkara, Suranga

论文摘要

语音的多模式情绪识别是情感计算中的重要领域。将多个数据模式和学习表示与有限的标记数据融合是一项艰巨的任务。在本文中，我们探讨了特定于模态特定的“伯特式”预验证的自我监督学习（SSL）体系结构，以代表语音和文本方式，以实现多模式语音情感识别的任务。通过对三个公开可用数据集（IEMOCAP，CMU-MOSEI和CMU-MOSI）进行实验，我们表明共同微调的“ Bert样” SSL体系结构可实现最先进的结果（SOTA）结果。我们还评估了两种融合语音和文本方式的方法，并表明，当使用具有与BERT相似的体系结构属性的SSL模型时，一种简单的融合机制可以优于更复杂的方法。

Multimodal emotion recognition from speech is an important area in affective computing. Fusing multiple data modalities and learning representations with limited amounts of labeled data is a challenging task. In this paper, we explore the use of modality-specific "BERT-like" pretrained Self Supervised Learning (SSL) architectures to represent both speech and text modalities for the task of multimodal speech emotion recognition. By conducting experiments on three publicly available datasets (IEMOCAP, CMU-MOSEI, and CMU-MOSI), we show that jointly fine-tuning "BERT-like" SSL architectures achieve state-of-the-art (SOTA) results. We also evaluate two methods of fusing speech and text modalities and show that a simple fusion mechanism can outperform more complex ones when using SSL models that have similar architectural properties to BERT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题