使用音频标签的声学场景分类

论文标题

使用音频标签的声学场景分类

Acoustic Scene Classification using Audio Tagging

论文作者

Jung, Jee-weon, Shim, Hye-jin, Kim, Ju-ho, Kim, Seung-bin, Yu, Ha-Jin

论文摘要

使用深神经网络的声学场景分类系统将给定记录分类为预定义的类。在这项研究中，我们提出了一种新型的声学场景分类方案，该方案采用了受人类感知机制启发的音频标记系统。当人类识别声学场景时，不同声音事件的存在提供了影响判断的歧视性信息。提出的框架使用各种方法模仿了这种机制。首先，我们采用三种方法来使用音频标记系统提取的连接标签向量，并具有声音场景分类系统的中间隐藏层。我们还使用标签向量探索了声学场景分类系统的特征图上的多头注意力。对声学场景和事件的检测和分类2019 Task 1-A数据集进行的实验证明了拟议方案的有效性。串联和多头注意力的分类精度分别为75.66％和75.58％，而基线的精度为73.63％。提出的两种方法结合使用的系统表明准确性为76.75％。

Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which affects the judgement. The proposed framework mimics this mechanism using various approaches. Firstly, we employ three methods to concatenate tag vectors extracted using an audio tagging system with an intermediate hidden layer of an acoustic scene classification system. We also explore the multi-head attention on the feature map of an acoustic scene classification system using tag vectors. Experiments conducted on the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate the effectiveness of the proposed scheme. Concatenation and multi-head attention show a classification accuracy of 75.66 % and 75.58 %, respectively, compared to 73.63 % accuracy of the baseline. The system with the proposed two approaches combined demonstrates an accuracy of 76.75 %.

下载PDF全文

下载文献需遵守相关版权规定

论文标题