Ir al contenido

Documat


Resumen de Training sound event classifiers using different types of supervision

Eduardo David Fonseca Montero

  • The automatic recognition of sound events has gained attention in the past few years, motivated by emerging applications in fields such as healthcare, smart homes, or urban planning. When the work for this thesis started, research on sound event classification was mainly focused on supervised learning using small datasets, often carefully annotated with vocabularies limited to specific domains (e.g., urban or domestic). However, such small datasets do not support training classifiers able to recognize hundreds of sound events occurring in our everyday environment, such as kettle whistles, bird tweets, cars passing by, or different types of alarms. At the same time, large amounts of environmental sound data are hosted in websites such as Freesound or YouTube, which can be convenient for training large-vocabulary classifiers, particularly using data-hungry deep learning approaches. To advance the state-of-the-art in sound event classification, this thesis investigates several strands of dataset creation as well as supervised and unsupervised learning to train large-vocabulary sound event classifiers, using different types of supervision in novel and alternative ways. Specifically, we focus on supervised learning using clean and noisy labels, as well as self-supervised representation learning from unlabeled data.

    The first part of this thesis focuses on the creation of FSD50K, a large-vocabulary dataset with over 100h of audio manually labeled using 200 classes of sound events. We provide a detailed description of the creation process and a comprehensive characterization of the dataset. In addition, we explore architectural modifications to increase shift invariance in CNNs, improving robustness to time/frequency shifts in input spectrograms. In the second part, we focus on training sound event classifiers using noisy labels. First, we propose a dataset that supports the investigation of real label noise. Then, we explore network-agnostic approaches to mitigate the effect of label noise during training, including regularization techniques, noise-robust loss functions, and strategies to reject noisy labeled examples. Further, we develop a teacher-student framework to address the problem of missing labels in sound event datasets. In the third part, we propose algorithms to learn audio representations from unlabeled data. In particular, we develop self-supervised contrastive learning frameworks, where representations are learned by comparing pairs of examples constructed via data augmentation and automatic sound separation methods. Finally, we report on the organization of two DCASE Challenge Tasks on automatic audio tagging with noisy labels. By providing data resources as well as state-of-the-art approaches and audio representations, this thesis contributes to the advancement of open sound event research, and to the transition from traditional supervised learning using clean labels to other learning strategies less dependent on costly annotation efforts.


Fundación Dialnet

Mi Documat