分析基于常数Q滤波器的表示语音情感识别的表示

论文标题

分析基于常数Q滤波器的表示语音情感识别的表示

Analysis of constant-Q filterbank based representations for speech emotion recognition

论文作者

Singh, Premjeet, Waldekar, Shefali, Sahidullah, Md, Saha, Goutam

论文摘要

这项工作分析了语音情感识别（SER）的基于常数Q滤波器的时频表示。 Constant-Q FilterBank提供了非线性光谱时间表示，在低频下具有较高频率的分辨率。我们的调查揭示了增加的低频解决方案如何有益于Ser。短期MEL频谱系数（MFSC）和基于恒定的Q滤波器基于恒定的频率bank特征，即常数Q变换（CQT）和连续小波tronveral（CWT）之间的时间域比较分析，揭示了恒定Q表示在低频时提供较高的时间流动性。这为情绪上的鲁棒性增加了，尤其是对于低音情绪而言，尤其是时间上无关的时间变化。与MFSC相比，在不同情绪类别上的相应频域分析显示，基于常数Q的时频表示，在基于常数Q的时频表示中的分辨率更好。 Constant-Q表示的这些优势通过SER性能在四个公开可用数据库的广泛评估中进一步巩固，其中六个先进的深神经网络体系结构是后端分类器。我们在这项研究中的推论暗示了Constant-Q特征对SER的适用性和潜力。

This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectro-temporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in constant-Q-based time-frequency representations than MFSC. These advantages of constant-Q representations are further consolidated by SER performance in the extensive evaluation of features over four publicly available databases with six advanced deep neural network architectures as the back-end classifiers. Our inferences in this study hint toward the suitability and potentiality of constant-Q features for SER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题