实例级损失基于声学场景分类的多种局限性学习框架

论文标题

实例级损失基于声学场景分类的多种局限性学习框架

Instance-level loss based multiple-instance learning framework for acoustic scene classification

论文作者

Choi, Won-Gook, Chang, Joon-Hyuk, Yang, Jae-Mo, Moon, Han-Gil

论文摘要

在声学场景分类（ASC）任务中，声学场景由多种声音组成，可以通过确定它们之间不同属性的组合来推断。这项研究的目的是使用改进的ASC的多个稳定学习（MIL）框架有效地提取和聚集这些属性。 MIL被称为弱监督的学习方法，是一种从组成输入音频剪辑的框架中提取实例的策略，并使用这些未标记的实例推断出与输入数据相对应的场景。但是，许多研究指出了MIL的低估问题。在这项研究中，我们通过定义实例级标签和损失来有效提取和聚类实例，开发了更适合ASC系统的MIL框架。此外，我们设计了一个完全分开的卷积模块，该模块是一个轻巧的神经网络，包括侧面，频率深度和时间侧面深度卷积过滤器。结果，与Vanilla MIL相比，积极实例的置信度和比例大大增加，克服了低估问题并提高了分类准确性高达11％。拟议的系统在2019年的Tau Urban声学场景和2020年的移动数据集中，其性能分别为81.1％和72.3％，分别为139 K参数。尤其是，它在Tau Urban声学场景2019数据集中具有1 M参数的系统中达到了最高的性能。

In the acoustic scene classification (ASC) task, an acoustic scene consists of diverse sounds and is inferred by identifying combinations of distinct attributes among them. This study aims to extract and cluster these attributes effectively using an improved multiple-instance learning (MIL) framework for ASC. MIL, known as a weakly supervised learning method, is a strategy for extracting an instance from a bundle of frames composing an input audio clip and inferring a scene corresponding to the input data using these unlabeled instances. However, many studies pointed out an underestimation problem of MIL. In this study, we develop a MIL framework more suitable for ASC systems by defining instance-level labels and loss to extract and cluster instances effectively. Furthermore, we design a fully separated convolutional module, which is a lightweight neural network comprising pointwise, frequency-sided depthwise, and temporal-sided depthwise convolutional filters. As a result, compared to vanilla MIL, the confidence and proportion of positive instances increase significantly, overcoming the underestimation problem and improving the classification accuracy up to 11%. The proposed system achieved a performance of 81.1% and 72.3% on the TAU urban acoustic scenes 2019 and 2020 mobile datasets with 139 K parameters, respectively. Especially, it achieves the highest performance among the systems having under the 1 M parameters on the TAU urban acoustic scenes 2019 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题