Table of Contents
- Typical ML Monitoring Frustration
- Current Bottleneck of the Traditional ML Monitoring Workflow
- The New ML Monitoring Workflow
- So, what's the payoff?
- Take One Regression Task as an Example
- Stage 1: Estimated Performance Monitoring
- Stage 2: Automatic Root Cause Analysis
- Multivariate Drift Detection 🔗
- Univariate Drift Detection 🔗
- Stage 3: Issue Resolution
- Closing Thoughts
Do not index
Do not index
Canonical URL
Typical ML Monitoring Frustration
You’ve trained and deployed your machine learning model, and now it’s running in production. For a few months, everything seems smooth—your model’s predictions are accurate, and the results align with expectations. But then, one evening, you get an urgent call from the sales team: the numbers are completely off. The model has failed, and no one saw it coming.
What went wrong? Despite spending countless hours monitoring the model’s performance and sifting through false alarms triggered by data drift, silent failures still occur: undetected until the damage is done.
Recent studies show that 91% of ML models fail silently over time. Behind these silent failures lies an overlooked issue known as temporal model degradation—in other words, machine learning models “age” as time progresses.
Yet, the models themselves are not aging. The real cause is often data drift, caused by changes in statistical properties due to shifts in the upstream data-producing environment. These shifts can stem from various issues, such as sensor malfunctions, changes in customer behavior, policy updates, or other external factors. Eventually, the model becomes misaligned with its new reality and fails to perform as expected.
In an ideal scenario, data scientists are alerted when a model’s performance changes and use data drift detection methods to identify the specific features causing the problem. However, calculating model performance is not always possible, and pinpointing the exact cause of data drift can be challenging and sometimes frustrating.
Current Bottleneck of the Traditional ML Monitoring Workflow
Traditional machine learning monitoring workflow relies heavily on ground truth labels to evaluate model performance. In extreme situations, when the labels are difficult to obtain, companies rely solely on data drift detection. This reactive approach presents two significant challenges:
First, changes in data distribution don’t always correspond to a decline in model performance. For example, drift in a less critical feature might not impact the model’s accuracy but could trigger a flood of false alarms, diverting attention to non-issues and wasting valuable resources.
Second, waiting for true labels creates a significant delay in identifying and addressing performance issues. This lag not only slows down decision-making but also hinders effective communication across teams. By the time the problem is fully understood, it might be too late for quick intervention, potentially resulting in substantial financial losses.
In summary, these shortcomings render traditional monitoring workflows inefficient and overly reactive. A more proactive approach is needed—one that can quickly detect performance issues without relying exclusively on ground truth labels or generating excessive false positives.
The New ML Monitoring Workflow
To overcome the limitations of traditional monitoring, we advocate a new three-stage ML monitoring workflow designed for greater efficiency and proactivity. This workflow, illustrated below, redefines the way we monitor, diagnose, and resolve model performance issues.
Here is a brief overview of each stage:
- Performance Monitoring 👑: Different from the traditional ML monitoring workflow, Instead of waiting for ground truth labels to assess model performance, we leverage novel probabilistic algorithms to estimate performance in near real-time. This approach eliminates the delays inherent in traditional methods, enabling us to detect issues as soon as they arise.
- Root Cause Analysis: When a performance change is detected, data drift analysis is used to pinpoint its root cause. Both multivariate and univariate drift analyses can be conducted, depending on the situation. By visualizing changes in feature distributions and analyzing timestamps associated with performance change, this stage isolates the specific features driving the problem.
- Issue Resolution: After identifying the root cause, the next step is to address it. Solutions may vary depending on the production environment, but common approaches include retraining the model with newly collected data to restore performance, readjusting prediction thresholds, resolving downstream issues, and improving upstream data quality.
So, what's the payoff?
- Fewer False Alarms: This workflow minimizes false data drift alarms by estimating the model’s performance in near real-time, enabling us to focus on the actual issues.
- Timely Reporting: Performance concerns are flagged earlier, enabling swift action to mitigate potential impacts.
- Continuous Performance Monitoring: Your model's performance is monitored 24-7, providing a consistent safety net against silent failures.
Take One Regression Task as an Example
Let’s dive into a hands-on demonstration using the NannyML library on Google Colab. You can access this notebook to follow along with the code as you explore this section.
For this example, we use our library's built-in Car Loan dataset. The dataset describes a machine learning model that predicts whether a customer will repay a loan to buy a car. It has been intentionally altered to simulate real-world challenges to include variables that cause a model drop during the post-deployment stage.
The task is to leverage NannyML’s workflow to estimate the model’s performance after deployment. If a performance drop is detected, the workflow automatically flags high-risk variables using our built-in data drift detection methods.
With everything set up, let’s see how this new ML Monitoring Workflow performs in action.
Stage 1: Estimated Performance Monitoring
This stage represents the innovative core of our workflow—the point where the real transformation happens. Here, our major focus is to estimate the model’s performance during the post-deployment period. To achieve such a goal, we have three algorithms available for use:
- DLE (Direct Loss Estimation Probabilistic): Designed for regression tasks, estimates the performance of regression models by calculating the loss function for each observation and transforming it into performance metrics.
- CBPE (Confidence-based Performance Estimation): Designed for classification tasks, which is used to estimate the performance of classification models as they return predictions with an associated confidence score.
- PAPE (Probabilistic Adaptive Performance Estimation): Our CBPE model falls short when dealing with strong covariate shifts, which can significantly impact the quality of calibration. To address this, we developed the PAPE algorithm. PAPE calculates the ratio of probability density functions between the reference dataset and the analysis dataset. This ratio is then used to perform weighted calibration on the reference data, allowing the calibration results to accurately reflect the uncertainty present in the analysis data.
As our task is a classification, we use the CBPE algorithm for model performance estimation, and to train our model, we will need a reference period and an analysis period.
The purpose of the reference period is to establish a baseline of expectations for our algorithm to be trained, during which the feature label pairs are expected to be stable, with acceptable accuracy.
Sequentially, the analysis period is where the algorithm analyses the overall performance of the monitored model, leveraging insights from the reference period.
At this stage, the workflow functions like a bad weather alarming system ☔️, triggering alerts when an estimated model performance change is detected.
For evaluation, we select two metrics for model performance evaluation: F1 Score and ROC AUC Score (Fig 1).
It is important to recognize that performance estimations may vary based on different metrics, depending on your business needs. For example, the performance estimation suggests a performance drop in the F1 score while the changes in the ROC AUC score do not raise any alarm. If we value the F1 score more than the ROC AUC score, we should pay attention to our model's degradation (indicated by red diamond alert icons) to occur after late March.
Stage 2: Automatic Root Cause Analysis
Building on the initial performance monitoring results, we now need to dive deeper into the root cause of the F1 score drop. For this stage, we exercise two critical detection methods: multivariate drift detection and univariate drift detection.
Multivariate Drift Detection 🔗
The multivariate drift detection provides a single summary metric, reducing the risk of false alerts and detecting more subtle changes.
For this step, we use Data Reconstruction with PCA to measure the Reconstruction Error, measured by averaging the Euclidean distance between the original and the reconstructed data points in a dataset. As PCA learns the internal structure of the data, a significant change in the reconstruction error means that the learned structure no longer accurately approximates the current data structure.
In short, the reconstruction error will be calculated over time for the monitored model and raises an alert if the values get outside a range defined by the variance in the reference data period.
After a quick automated analysis (Fig 2), we observed a sharp increase in reconstruction error around mid-February. This highlights a drift in the underlying data distribution, corresponding to the period when the model’s performance starts to change. The red diamond markers flag moments when the error surpasses predefined thresholds.
To deeply understand the actual cause, we are implementing the univariate analysis in our toolbox to focus on changes within individual features that might be missed in broader analyses.🔍
Univariate Drift Detection 🔗
Univariate Drift Detection looks at each feature individually to determine whether its distribution has changed compared to reference data. By default, the built-in univariate analysis in our library uses the Jensen-Shannon distance—a measure of divergence between probability distributions, to detect changes over time. The analysis revealed that multiple features experienced data drift, which led to the model's changes in performance.
While we conducted univariate analysis on all features, to simplify the demonstration process, we have selected two features that were drifted, car_value (continuous) and repaid_loan_on_prev_car(categorical).
Univariate Analysis - Car Value:
The distribution plots (Fig 3) provide a daily view of the distribution of car_value, with each curve representing the feature's distribution for a specific day. The blue and purple colors represent the reference period and the analysis period. Whereas the red highlights the periods flagged during the performance monitoring stage for potential data shifts. During mid-February, we can observe the overall car price is rising. Hence, a significant data drift is evident: quantiles shift noticeably, impacting the model’s predictions and contributing to model degradation.
Univariate Analysis - Repaid Loan on Previous Car:
Figure 4 below shows how the distribution of the categorical feature changes over time. Notably, in mid-February, there is a significant deviation highlighted by the red bars compared to the earlier distribution. During this alert period, we observed a noticeable increase in the number of borrowers who paid off their previous loans relative to the baseline, indicating a data drift.
The Possible Story Behind:
By analyzing other plots from the notebook, we observe similar patterns of distributional changes on some of the other features occurring in mid-February. This suggests the presence of one or more features influencing others.
These shifts may reflect real-world trends such as economic growth, reduced interest rates, rising salaries, or similar factors, which eventually cause ripple effects on other key feature distributions, and ultimately contribute to the model's performance degradation.
Stage 3: Issue Resolution
The observed increase in car prices, and the number of loaners who paid off their debt, suggest a potential strategic adjustment needs to be considered by the company.
Coordination and communication with relevant departments are essential to confirm the underlying causes. The good news is that with the deployment of the ML monitoring workflow, we now have more time to thoroughly investigate these drifts and gather the evidence needed to support informed decisions. If the changes are indeed driven by the factors listed above, the model might need to be retrained on data reflecting the updated distribution to ensure it continues to perform effectively.
Additionally, after retraining, we need to validate the model's performance to confirm its alignment with the new data and establish continuous monitoring to promptly detect any future shifts. By maintaining thorough documentation and fostering ongoing collaboration with stakeholders, we can ensure our machine-learning system remains robust and reliable in a dynamic business environment.
Closing Thoughts
Monitoring machine learning models in production can be a challenging and frustrating task. Many data scientists have faced the same issues: worrying about model performance while waiting for actual labels, or spending countless hours investigating false alarms triggered by unimportant red flags. Ultimately, we all strive for the same goal: machine learning models that are both accurate and robust.
That’s why we advocate for a performance-centric monitoring workflow—a monitoring workflow that is reliable, efficient, and proactive. By adopting this workflow, we can focus our time and effort on addressing the issues that truly matter, ensuring better outcomes for our models and their applications.