The machine learning lifecycle begins with data warehousing, ETL pipelining, and model training. At Algorithmia, we focus on the next stages in the lifecycle: deployment, management, and operations. Machine learning deployment plays a critical part in ensuring a model performs well, both now and in the future, but it is also vitally important to understand model monitoring and model drift to that same end.
By monitoring for model drift, you can tell if your model is getting worse over time. For example you can monitor if your model’s accuracy takes a hit after deploying a new model.
An effective model drift monitoring process can ensure that your critical production system can safely roll out newer versions but still revert back to a more stable version if needed.
Many machine learning models tend to be black boxes, where explainability is very limited, which can make it difficult to understand why a model is not performing as expected. This is especially true with regard to how a model performs over time with new training data.
A model that was initially working pretty well could later degrade due to a concept called data drift or concept drift. Data drift occurs when the underlying statistical structure of your data changes over time.
For example, let’s say that we initially had a facial recognition model that was trained on human faces not wearing glasses. After we introduce a more diverse dataset that includes faces with glasses, that underlying representative structure of the face would change. The model will not know how to classify glasses, which could affect the model’s ability to recognize faces. It could require changes to the model or changes to how the model is trained.
How to remedy model drift
Usually model drift is observed by end users of the product. You notice that the speech recognition of your smart speaker degrades over time, and by the time the company’s engineers learn about it, it could be non-trivial to revert back to an older version.
One reason why is because every model has a finite amount of expression power (ie. learning capability). Certain machine learning architectures can learn and generalize certain kinds of data structures better than others.For instance, in speech recognition, when you try to add more languages to your learning model without altering the expressibility of the model, or don’t realize that the underlying statistical structure of your data changes, you can end up with a far worse performing model than before.
And because rolling back could cause you to also lose access to newer features of the model, like added language support, product decisions might block you from rolling back to an older, more stable model version.
Defining critical thresholds, detecting violations of these thresholds, and safeguarding the machine learning production system from degradation is the main goal of model monitoring.
Model monitoring is very similar to continuous integration/continuous deployment (CI/CD) in traditional software development. In CI/CD systems, you monitor the whole software development and deployment lifecycle using automated tools and alerting. The goal of model monitoring is to bring some of these well-established rules and systems into modern production systems that utilize machine learning.
Machine learning is inherently different from traditional software development systems in that:
- GPUs are a core component of machine learning. Traditional CI/CD systems aren’t designed to deal directly with GPUs and GPU–compiled code (like CUDA).
- Traditional CI/CD systems aren’t designed to run data science experiments, and running these tests involves building out extensive ETL pipelines to deal with data sources for experimentation.
- Traditional debugging tools in CI/CD systems aren’t super useful for troubleshooting issues related to model/data drift.
How to detect model drift
What are thresholds in model drifting and how can you monitor for it? Here are a few examples of thresholds:
- Average model runtime
- Did mix-precision optimization actually help with performance?
- Model metrics where metrics can be accuracy, precision, recall, f1-score, etc.
- Are we getting way more false positives with the updated speech recognition model?
- Data metrics where metrics can be unbalanced classes, overfit-inducing data distribution, etc.
- Do we need to tweak the model architecture to adapt to the new underlying structure of the data? Or maybe we need to change how we train our model on the data.
When a model is deployed, we shouldn’t immediately route all traffic to the new version endpoint. Only a fraction of the traffic (~5 percent) should be routed to the new model and should be assessed using one or more of the thresholds.
If a violation of a threshold is detected, the system should flag the appropriate model as decommissioned, and roll back (the 5 percent of requests) to the latest stable version.
If a violation isn’t detected after a specific amount of time, the system should start rolling out (the remaining 95 percent) to the new model and flag it as the latest stable version.
This monitoring system can provide an automatic way of deploying ML models into production without causing significant model drift to appear.
It is worth mentioning that there are also other approaches to evaluating models, such as A/B testing, champion-challenger testing, etc.
Check out this blog article for more about model evaluation.
What are some possible application areas where model monitoring makes sense?
- Data drift: If you start seeing user engagement taking a dip after a model deployment, this might be a sign of model drifting.
- Model learning capacity: Specific versions of a model work great with subsets of the user population. This can happen when you hit the model’s learning capacity. A valid strategy here could be using different models for different user sub-population groups. Before doing this, being able to monitor your models for drift is key.
- Performance degrading change: Small model architectural changes can inadvertently lead to severe performance degradation. Having an easy and automated way to roll back makes your machine learning system less fragile.
- Incorrect deployments and bugs: Being able to quickly test a new model on an ad hoc basis to catch bugs can prevent silent issues from getting into production. (For instance, maybe the production servers use slightly different GPU hardware that might trigger a bug in your model and ML framework.)
- Maintenance and manual deployment cycles: Having an automated system saves time and resources for data scientists. They shouldn’t be bogged down building and debugging their machine learning infrastructure.
A closer look
Designing and implementing model monitoring can look very different depending on the platform you’re running on. Algorithmia takes a very flexible approach. Our platform is a GPU–attached serverless service, which allows you to take advantage of a serverless architecture.
What does model monitoring look like on Algorithmia? We have an orchestration algorithm that manages all of the business logic of determining if a model has drifted and making the decision to roll out a new model.
The orchestration algorithm would basically do the following:
- If no stable versions exist, make the first published version of the algorithm the stable version.
- Keep track of:
- The stable model version
- Deprecated model versions
- Failed model versions
- Route all requests to the model:
- All requests to the stable version, if there’s no test model
- 95 percent of them to the stable version and 5 percent of them to the test model
- Keep track of the model drift metrics:
- This will be used to determine if the model has drifted at the end of the testing period.
- Promotion or rollback:
- If the model hasn’t drifted, the test model will be promoted as the stable version. The previous stable model will be deprecated.
- If model drift has occurred, the test model is tagged as failed, and all requests are routed to the stable version.
The 5 and 95 percent split here is arbitrary, and can be tuned according to the use case. In terms of implementation, a database can be used to ensure ACID transactions, since all requests are made to a distributed serverless platform. This is to prevent race conditions from happening.
Note: We’ve also created an example algorithm to demonstrate a similar workflow that is described above. You can find the source code here.
Conclusion and continued learning
We’ve touched on a number of new concepts like data drift, model drift, and model monitoring. We’ve learned how model monitoring helps safeguard ML systems from silent performance degradation over time, helps detect model learning capacity issues, and helps set up an automated deployment cycle for data scientists. We also took a closer look at how it might work in a serverless deployment environment.