Representation learning for audio data

Applying classical machine learning methods to complex data, such as audio of human speech, often requires extensive feature engineering. This requires domain knowledge to identify and extract the useful parts of the signal.

For example, in a speaker classification task, you might hand-engineer spectral features, but these aren’t robust across different microphones or environmental conditions. With deep learning, models can learn representations directly from the data, which reduces manual feature engineering. However, downstream performance still depends on the quality of those representations.

In this talk with my team at Faculty, I discuss representation learning for speaker classification. We cover the basics of representation learning and variational autoencoders, then walk through an architecture that uses labelled data to learn representations well-suited for speaker classification.