Visibility and monitoring in deployed ML systems

When deploying machine learning models into production systems, training the model is only the first step. Production deployments bring the challenge of monitoring and understanding the behaviour of live ML systems. It’s much harder to detect ML problems, such as population drift, than system problems, such as disk exhaustion or elevated error rates.

In this talk with my colleague Víctor, I discuss what we learned while building tools and workflows for monitoring ML systems. We cover how software monitoring changes for ML, what to monitor and how to monitor it (including population drift, domain shift, and historical backtests), and why machine learning engineers need to be involved in the monitoring process.