Khaled Koutini,
"Inductive Bias in Learning General Audio Representations"
, 2022
Original Titel:
Inductive Bias in Learning General Audio Representations
Sprache des Titels:
Englisch
Original Kurzfassung:
Machine auditory perception is a critical component in the development of artificial intelligence systems capable of comprehending their surroundings. Perceiving and understanding audio signals has numerous applications in current and future technologies, such as content-based multimedia information retrieval, context-aware smart devices, and monitoring systems.
The purpose of this thesis is to advance machine learning methods and their applications in audio signal processing. In particular, recent successful machine learning approaches are improved and adapted for audio processing. I evaluate and compare these methods on a wide range of acoustic tasks such as auditory scene and environmental analysis and understanding, music information retrieval, audio tagging, and acoustic event detection. In this thesis, I will show how to incorporate specific inductive biases appropriate for the audio domain into general-purpose machine learning approaches, resulting in robust, efficient, and performant auditory perception systems.
The first part of this thesis is devoted to convolutional neural networks (CNNS), the dominant method for processing and perceiving audio signals. I investigate how to improve the generalization of these networks on a variety of audio and music processing tasks, including acoustic scene classification, polyphonic musical instrument recognition, and emotion and theme detection in music. I will demonstrate how to optimize and regulate the receptive field of these models to substantially improve their generalization. Furthermore, I will analyze which characteristics of these networks have the greatest effect on their performance. Consequently, I will show how to scale these networks in order to improve their generalizability in both low-complexity and over-parameterized regimes.
The second part of this thesis is devoted to self-attention and transformer models, which have recently gained popularity due to their increasing success in a variety of vision tasks, following their dominance in natural language processing tasks. I introduce Patchout, a novel training method and an inductive bias suitable for training transformer models on audio spectrograms. I show how this improves transformer performance on the largest publicly available audio classification dataset. Additionally, I investigate how to extract useful representations from pre-trained transformer models in order to improve the performance of machine learning methods on downstream tasks with limited data.
In summary, the models and approaches proposed in this thesis advance the state of the art in audio signal processing and comprehension, as demonstrated across a wide range of tasks and applications. They are also easily adaptable to new tasks and applications.