论文标题
分析基于常数Q滤波器的表示语音情感识别的表示
Analysis of constant-Q filterbank based representations for speech emotion recognition
论文作者
论文摘要
这项工作分析了语音情感识别(SER)的基于常数Q滤波器的时频表示。 Constant-Q FilterBank提供了非线性光谱时间表示,在低频下具有较高频率的分辨率。我们的调查揭示了增加的低频解决方案如何有益于Ser。短期MEL频谱系数(MFSC)和基于恒定的Q滤波器基于恒定的频率bank特征,即常数Q变换(CQT)和连续小波tronveral(CWT)之间的时间域比较分析,揭示了恒定Q表示在低频时提供较高的时间流动性。这为情绪上的鲁棒性增加了,尤其是对于低音情绪而言,尤其是时间上无关的时间变化。与MFSC相比,在不同情绪类别上的相应频域分析显示,基于常数Q的时频表示,在基于常数Q的时频表示中的分辨率更好。 Constant-Q表示的这些优势通过SER性能在四个公开可用数据库的广泛评估中进一步巩固,其中六个先进的深神经网络体系结构是后端分类器。我们在这项研究中的推论暗示了Constant-Q特征对SER的适用性和潜力。
This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectro-temporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency spectral coefficients (MFSCs) and constant-Q filterbank-based features, namely constant-Q transform (CQT) and continuous wavelet transform (CWT), reveals that constant-Q representations provide higher time-invariance at low-frequencies. This provides increased robustness against emotion irrelevant temporal variations in pitch, especially for low-arousal emotions. The corresponding frequency-domain analysis over different emotion classes shows better resolution of pitch harmonics in constant-Q-based time-frequency representations than MFSC. These advantages of constant-Q representations are further consolidated by SER performance in the extensive evaluation of features over four publicly available databases with six advanced deep neural network architectures as the back-end classifiers. Our inferences in this study hint toward the suitability and potentiality of constant-Q features for SER.