Representation learning for audio data

September 2020

The application of classical machine learning methods on complex data formats, such as audio of human speech, typically necessitates extensive feature engineering. This requires significant domain knowledge to extract the key components of the data.

For example, for a speaker classification task, one might hand-engineer spectral features, but these aren’t robust across different microphones or environmental conditions. Deep learning can allow models to learn their data representations, removing the need for feature engineering. However, as the quality of the learned representations strongly influences performance on downstream tasks, we must ensure that these representations are appropriate.

In this talk with my team at Faculty, I explore the subject of representation learning and its application to speaker classification. We provide an overview of representation learning and variational autoencoders before discussing an architecture that employs labelled data to learn representations well-suited to speaker classification tasks.