Effective ML Monitoring: A Hands-on Example

NannyML’s ML monitoring workflow is an easy, repeatable and effective way to ensure your models keep performing well in production.

Effective ML Monitoring: A Hands-on Example
Do not index
Do not index
Canonical URL

Introduction

The ultimate goal of using machine learning (ML) in the industry is to deliver value to your organization. As data scientists, it's a constant challenge to align our ML solutions with the company's KPIs and ensure our predictive models positively impact the business. Some might even argue that the ability to do this defines a successful data scientist.
A crucial part of achieving this success is monitoring your ML models in production to ensure they are, indeed, performing well once they are put to use. To achieve this, it is important to know which steps can be systematically taken to ensure your models are being monitored properly. Having a workflow that is simple, repeatable, and effective can help you relax and not worry about something going catastrophically wrong with your models in production
At NannyML, we have established a performance-centric ML monitoring workflow. This approach has three main steps: continuous performance monitoring, root cause analysis, and issue resolution. In this blog, we will show how to apply this workflow with a classification model and demonstrate how it can help you get your failing model back on track.

Why you should use a performance-centric monitoring workflow

When monitoring your ML model in production, a key focus is detecting declines in performance, which are almost certain to occur over time.
For supervised learning tasks, most monitoring is done by detecting drifts in the distribution of data and calculating the model's realized performance when target data is available. Detecting data drift involves monitoring covariate shift, concept drift, and label shift, all of which are covered in many NannyML blogs. Monitoring realized performance is achieved by calculating metrics such as F1 score, Recall, RMSE, and MAE, depending on whether we are dealing with a classification or regression problem.
Covariate shift is a type of data drift that refers to changes in the distribution of the model’s input features, represented as changes in , while the relationship between the input features  and the target variable  (i.e., ) remains unchanged.
Covariate shift is a type of data drift that refers to changes in the distribution of the model’s input features, represented as changes in , while the relationship between the input features and the target variable (i.e., ) remains unchanged.
Although it is important to take the above-described steps, this approach has its flaws. Firstly, while distribution shifts are often the cause of performance drops, they are inadequate for detecting these drops. Covariate shift can occur without necessarily affecting the model's performance. Therefore, detecting covariate shift is more suitable for root cause analysis, which helps understand the reason for the observed drop in performance.
Secondly, while it is effective to establish a performance drop by continuously calculating evaluation metrics, this requires access to ground truth (or target data). Unfortunately, in most production environments, this data is either unavailable or becomes available only after the model's predictions have already impacted the business.
notion image
The proposed workflow addresses the problems mentioned above. This process involves three main steps:
  1. Performance monitoring is the first step of the workflow and must be carried out continuously. This step calculates realized performance and estimated performance when ground truth data is unavailable.
  1. Automated root cause analysis is conducted once a drop in performance is observed. This is achieved by detecting covariate shifts and leveraging domain expertise.
  1. Finally, depending on the cause of the performance drop, we choose the appropriate method to resolve the issue.
Santi has already written a comprehensive blog post outlining this workflow, which you can refer to for an in-depth explanation. Here, we will jump into a practical example and demonstrate how to apply this workflow using a real ML model.

Applying the monitoring workflow

We will demonstrate how to follow the workflow for a model predicting hotel booking cancellations. We will do this by implementing it in NannyML Cloud. While it is also possible to fully implement this workflow using NannyML OSS, that would take considerably more time to orchestrate in a real-world production environment.
We use the predictions from a simple LightGBM classifier model, which achieves an AUROC of 0.87 on the test set.
The only prerequisite for this tutorial is that you have a running instance of NannyML Cloud. NannyML Cloud can be installed as a managed application on Azure or as a Helm Chart application on AWS. These options ensure that your data never leaves your cloud environment, making them excellent choices. Alternatively, NannyML Cloud can also be set up as a SaaS. For more information on getting set up with NannyML Cloud, please refer to our documentation.
The only prerequisite for this tutorial is that you have a running instance of NannyML Cloud. NannyML Cloud can be installed as a managed application on Azure or as a Helm Chart application on AWS. These options ensure that your data never leaves your cloud environment, making them excellent choices. Alternatively, NannyML Cloud can also be set up as a SaaS. For more information on getting set up with NannyML Cloud, please refer to our documentation.
The datasets used to monitor the model should be split into two sets: the reference data and the monitored data. The reference data comes from a period during which we have established that our model behaves acceptably. It serves as a benchmark for evaluating the model's performance. Typically, this dataset is the test set used during model development or the most recent production data where the model performed according to expectations.
The monitored data, on the other hand, is the data from the period we are actively seeking to monitor. In practice, this data is continuously collected and used to make predictions. To learn how to build an automated data collection pipeline for monitoring, refer to our blog on the topic — it's very simple.

Performance monitoring

The first step of the workflow is performance monitoring. This step needs to occur continuously as new data is ingested into our model in production to make predictions. Depending on your specific application, you will want to perform this step on a schedule that best suits your needs, whether that be hourly, daily, weekly, or at another interval.
The way you monitor the performance of your model depends on whether or not target data is available. If target data is available, you can directly calculate any relevant performance metric, such as RMSE, MAE, AUROC, or Accuracy. However, in most cases, targets are not available for your model in production until after the model’s predictions have already impacted business decisions.
In such cases, we can rely on performance estimation methods, which, as their name suggests, can estimate the aforementioned performance metrics. Methods for performance estimation include DLE, CBPE, and PAPE, all developed by NannyML. To learn more about those methods refer to our documentation.
Performance estimation using PAPE
Performance estimation using PAPE
When monitoring performance, whether it is realized or estimated, you need to set a threshold for what is considered acceptable performance. If a metric indicates a value outside this predetermined threshold, an alert is raised. In the image below, the accuracy of the model is estimated using PAPE. Monitoring runs are conducted monthly, and we observe that starting in December 2016, the estimated performance falls below the threshold for acceptable performance, as indicated by the red diamonds. At this point, it’s time to move on to the second part of the workflow: performing root cause analysis.

Automated Root Cause Analysis

Many data scientists use covariate shift as a means to detect performance drops. However, in some cases, covariate shift will not negatively impact the performance of your model. Consequently, relying on covariate shift as the primary monitoring technique may lead to many false alerts, resulting in alert fatigue. Therefore, in this workflow, we instead use covariate shift as a tool to understand why your model’s performance has declined.
On NannyML Cloud, root cause analysis is carried out automatically. We offer a range of covariate shift detection methods you can choose to monitor, including both univariate and multivariate approaches. As their names imply, univariate methods track the performance of a single input feature, while multivariate methods track the performance of several input features. Deciding which method is right for you depends on your specific use case and the type of features you are working with. Fortunately, we have a blog that guides you in making the right decision.
First, we check for multivariate drift. As shown in the image below, using the PCA reconstruction method for multivariate detection, we observe that the occurrence of drift corresponds with the drop in performance. This indicates that the performance decline is related to changes in the data structure over time.
Multivariate drift is detected using the PCA reconstruction method.
Multivariate drift is detected using the PCA reconstruction method.
Next, we examine distribution shifts of individual features. We recommend starting with features that have a significant impact on the model's performance. In this example, we will investigate whether covariate shift occurs in the following features: hotel, lead_time, parking_spaces, and country. For categorical features, we use the L-infinity method, and for continuous features, we use the Wasserstein method.
Drift detection for the following features: country (top left), hotel (top right), lead_time (bottom left), parking_spaces (bottom right)
Drift detection for the following features: country (top left), hotel (top right), lead_time (bottom left), parking_spaces (bottom right)
We observe that the drift detected in the lead_time feature corresponds strongly with the performance drop. Additionally, the drifts identified for the country and hotel features also somewhat align with the performance decline.
Domain expertise can help us better understand why these features might have drifted. For example, shifts in the distribution of country could occur because different countries book vacations at varying times, often corresponding with official holidays and seasons. The distribution of hotel might shift as people visit different regions more frequently at certain times of the year. Similarly, lead_time could vary as people book their holidays further in advance at different times of the year. NannyML Cloud also allows users to check summary statistics, such as average, median, and standard deviation, for the model's features, which can further help interpret changes in the data.

Issue resolution

Depending on the cause of the model degradation, we recommend three courses of action to get your model back on track. Firstly, if the cause of the model degradation is concept drift, retraining the model is the preferred solution. Unfortunately, detecting concept drift requires access to target data, which isn't available in our current example. However, if you do have access to target data, the RCD algorithm developed by NannyML can be a valuable tool for concept drift detection.
Issue resolution
Issue resolution
If the performance drop is related to data quality, we recommend addressing this issue upstream. This could be due to broken data pipelines, data becoming unavailable because of changes in data collection processes, or various other reasons. It is important to note that you can also monitor data quality metrics, such as the presence of missing data and the occurrence of unseen data.
NannyML defines unseen data as a categorical value encountered in production that was not present during training.
NannyML defines unseen data as a categorical value encountered in production that was not present during training.
If the performance drop is due to covariate shift, we recommend adjusting the prediction threshold based on the needs of your particular model. For example, raising the decision threshold would result in an increased number of false negatives, meaning we would miss predicting actual cancellations. This could lead to an opportunity cost as we might hold rooms for guests who will eventually cancel.
On the other hand, lowering the decision threshold would result in an increase in false positives, leading to more non-cancellations being predicted as cancellations. This could result in overbooking. To find the optimal threshold, you must balance these consequences and make a decision accordingly.

Conclusion

As we have just seen, monitoring your ML model is crucial if you want your model to perform well when its predictions have a real impact on business decisions. The performance-centric workflow outlined in this blog demonstrates how we can establish an easy, repeatable, and above all, effective way of monitoring our models in production.
If you are interested in finding out how NannyML Cloud can help your company get the most out of your ML models, book a demo with one of our founders! NannyML Cloud offers 24/7 automated monitoring functionalities. It includes various covariate shift methods, performance estimation techniques, tools for measuring the business impact of a model, methods for measuring concept drift, and much more.
notion image

More on post-deployment data science

Our blog contains many interesting articles to help you better navigate the fast-growing field of post-deployment data science. From research pieces to tutorials for interesting use cases, we’ve got you covered. If you enjoyed this article, here are a couple of suggested readings to explore further:

Ready to learn how well are your ML models working?

Join other 1100+ data scientists now!

Subscribe

Written by

Miles Weberman
Miles Weberman

Data Science Writer at NannyML