Visibility and monitoring in deployed ML systems

May 2021

When deploying machine learning models into production systems, training the model is only the first step. Production deployments bring the challenge of monitoring and understanding the behaviour of live ML systems. It’s much harder to detect ML problems, such as population drift, than system problems, such as disk exhaustion or elevated error rates.

In this talk with my colleague Víctor, I share insights learned from building tools and workflows for monitoring ML systems. We cover a variety of topics, including the need to modify software monitoring practices for ML systems, the specifics of what to monitor and how to monitor it (ranging from population drift or domain shift to historical backtests), and the significance of involving machine learning engineers in the monitoring process.