Do not index
Do not index
Canonical URL
Introduction
Businesses depend on machine learning models to predict future demand. But these models aren't set-and-forget solutions.
Once deployed, they must be continuously monitored. Why? Because the world is constantly changing. Trends shift. Consumer behaviours evolve. External factors can disrupt even the best predictions.
This brings us to the concept of post-deployment data science. It’s the verticle of data science that focuses on monitoring and maintaining machine learning models after they have been put into production. The goal is to ensure that the model continues to perform as expected and to catch potential failures before they impact business.
In this blog, I explore how demand forecasting models can fail after deployment and share some handy tricks to correct or contain these issues. With the right approach, you can keep your forecasts reliable and your business running smoothly.
What Makes Demand Forecasting Difficult To Monitor?
At NannyML, we believe that effective monitoring is proactive, not reactive. Waiting for a problem to become evident can lead to significant losses i.e. missed sales, overstock, and inefficient supply chains. Proactive monitoring means catching potential issues early before they have a chance to disrupt operations.
For more insights on how to implement a proactive monitoring workflow, check out this blog written by Santiago
One of the core challenges in monitoring demand forecasting models lies in the fundamental nature of time series data. Time series models draw on patterns from the past to forecast the future. They depend on the idea that historical trends, seasonality, and relationships between variables will continue as they have before. Yet, the world around us is full of surprises. Economic shifts, market changes, and evolving consumer behaviours can all disrupt these patterns, making the future less predictable than we might hope.
When this happens, the model may start to drift.
A machine learning model experiences data drift when the relationship between its input variables and the target variable changes, even if only slightly.
Over time, these small inaccuracies can accumulate, resulting in forecasts that are no longer reliable.
Covariate Shift
A covariate shift is when only the distribution of the model’s input features change.
An economic downturn in a region can lead to a covariate shift, where the distribution of input features, P(X), changes. For example, rising unemployment rates or reduced consumer spending can alter P(X). This shift includes changes in features like household income and spending habits. Even if P(Y|X), the relationship between these features and the target variable, remains stable, the model can still struggle. It was trained on a different distribution previously.
This shift might affect multiple stores within the region. But the impact won't be the same everywhere. Regionally, P(X) might change due to broad economic factors. Depending on the store's customer base or competition, the shift could look different at an individual store level. The variations in how P(X) shifts across levels make it challenging for a model to stay accurate.
How To Detect Covariate Shift?
Detecting covariate shifts involves comparing the distribution of monitored data against reference data over time.
What is monitored and reference data?
For most monitoring related comparison’s we need two data distributions, one belongs to the reference period that is set as the optimal or benchmarked version against which we compare the distribution that belongs to the analysis period. The data derived from the analysis period (also known as monitored data) and is representative of the current state of production data.
Some forecasting models are built using large training and testing datasets, which give them a robust understanding of historical patterns. However, the monitored data is often much smaller in volume after deployment. This is because monitored data accumulates gradually as new time periods pass. This limited volume can make it difficult to get a comprehensive view of how well the model is performing. The best way is to be patient and instead use various aggregation techniques to summarise over features to get the better understanding of your model.
If only one feature might drift over time, this is termed univariate drift, while when the joint distribution of features drift, it is termed multivariate drift. In practice, you will have to look for both. NannyML has developed an open-source Python library that brings all things post-deployment data science to your ML models with just one pip install.
It provides an elaborate set of distance metrics and statistical tests for identifying the presence of univariate drift. A widely popular method is Jensen-Shanon Distance which calculates the dissimilarity between two data distributions. It treats shifts in either direction (increase or decrease in demand) equally, making it versatile for multiple forecasting scenarios.
import nannyml as nml
drift_calculator = nml.UnivariateDriftCalculator(
column_names=columns_to_evaluate,
timestamp_column_name='Order Date',
continuous_methods=['jensen_shannon',]
categorical_methods=['jensen_shannon'],
)
drift_calculator.fit(reference_data)
results = drift_calculator.calculate(monitored_data)
For multivariate drift, we can use the PCA reconstruction method and Domain Classifier Algorithm.
In this blog, I implement and compare these multivariate drift detection algorithms with NannyML on real-world time series data.
How To Fix Covariate Shift?
The first prompt might be to retrain the model, but it's not always useful. It's not necessary that every time the model drifts, there will be a correlating performance drop.
Once you confirm that there is a significant performance drop, first experiment with adjusting the threshold values. Consider how the time of year or specific business cycles might affect your thresholds. If covariate drift is cyclical, adjust thresholds to be more lenient or strict during known periods of variability. This helps in managing seasonal effects without unnecessary retraining.
Concept Drift
It occurs when P(Y∣X) changes, even if the input features' distribution P(X) remains stable. In other words, the model’s understanding of how input data maps to outcomes becomes outdated, and its learnings are no longer valid in the real world. During the pandemic, a forecasting model that was trained on pre-pandemic data may have experienced concept drift. Changes in consumer behavior, such as a sudden surge in online shopping and shifts in demand for certain products, would affect features like order frequency, delivery times, and purchase quantities.
How To Detect Concept Drift?
NannyML Cloud’s Reverse Concept Drift Algorithm detects concept drift (provided no covariate shift is present) and helps you quantify its impact on model performance.
It trains a new model with updated knowledge(or a concept such as a shift from in-store to online shopping). Next, it applies this updated model to historical data based on previous purchasing patterns. RCD quantifies the impact of the concept drift by comparing the new model's performance on this historical data with that of the original model. Additionally, RCD calculates magnitude estimation to determine the extent of this drift, helping you understand and adjust your model accordingly.
How To Fix Concept Drift?
Retrain your models with newer concepts. Adopt a strategy that balances adding new information while retaining the old. Instead of retraining the entire model, first focus on specific segments where the drift is most significant. Say, if the shift from offline to online shopping is more pronounced in certain regions, retrain the model using data from these regions first. You can also revisit model’s architecture and add new features to expand the capabilities of the model.
Complexity of Multi-Level Forecasting Models
In multi-level forecasting models, complexity arises from the need to accurately predict demand across different hierarchical levels, such as store, region, and category. A model trained on aggregated data may miss important details specific to individual stores, while a model trained on highly granular data may overfit local patterns and perform poorly at broader levels.
Data distributions and dynamics vary at each hierarchical level. For example, demand at a single store might be volatile due to local events or specific promotions. In contrast, regional demand might remain stable.
Performance issues at one level can be hidden by overall performance, making it hard to pinpoint and resolve problems.
How Do We Detect Issues in a Multi-Level Forecasting Model?
NannyML Cloud’s segmentation feature is particularly useful here. Segmentation helps pinpoint where the model's performance falters by chunking the data into meaningful portions. It allows you to incorporate more business context into your data science workflow during production without needing labels.
How Do We Fix Issues in a Multi-Level Forecasting Model?
Once you notice a particular segment drifting, you can do the following:
- Create Custom Metrics: Develop custom metrics at the segment level or business value metrics that reflect the true impact of the drift.
- Root Cause Analysis: Conduct a thorough root cause analysis for each segment to identify the factors contributing to the drift.
- Granular Insights: Get deeper insights into upstream processes affecting the segment, enabling more targeted interventions.
Remember, the strategies that apply to the entire dataset can also be adapted to specific segments for more precise model adjustments.
Delayed Ground Truth Data
After a prediction is made, ground-truth data evaluates its accuracy by comparing the predicted demand to the actual sales figures.
In a retail setting, sales data often arrives late. This lag can be due to the time it takes to process transactions, reconcile returns, and update inventory systems. Because the actual sales figures are not immediately available, there’s a delay in assessing how well the model’s forecasts align with reality. This postpones the ability to identify and correct our errors.
How Do We Evaluate the Performance of a Forecasting Model in the Absence of Ground Truth?
Since we can’t calculate performance, we can estimate it using the Direct Loss Estimation Algorithm. The DLE algorithm works by training an additional model, known as the "nanny model,"(that’s why we are known as NannyML😃) The nanny model learns to estimate the loss or error associated with the original model's predictions. Allowing you to estimate the model’s performance even when the actual ground truth data is delayed.
After your demand forecasting model generates predictions, the nanny model uses the original features and the predictions from the main model to estimate the errors. This estimation process allows you to monitor and adjust your model's performance in real-time, even without immediate access to the actual sales figures.
How Do We Fix Performance Issues for a Forecasting Model in the Absence of Ground Truth?
Estimating performance will allow you to fix any issues preemptively. If DLE suggests a possible performance decline in the model’s ability to estimate inventory stocks, it can be communicated to decision-makers that they should not rely entirely on the model.
Downstream Business Issues
Once a demand forecasting model is deployed, downstream business processes take over. Discrepancies between the model’s predictions and the execution of those predictions can cause serious issues. For instance, a model might forecast high demand for a product. However, the forecast's benefits are lost if the inventory management system does not stock the shelves as predicted or if the logistics team fails to deliver on time. This shows that accurate forecasting is not enough. The real success depends on translating these predictions into effective actions.
How Do We Detect It?
In a real-world scenario, forecasting errors can have different impacts depending on factors like product type, delivery time, or even regional demand variations.
This is where custom metrics come in handy. You can define personalised metrics that better align with the nuances of your business needs and monitor them to detect issues downstream.
Did I mention you can do all of this using Python?
For instance, you could create a weighted metric that gives more importance to high-demand products or specific regions where timely delivery is crucial. If delays or stock shortages in these areas result in significant business losses, monitoring a custom metric like this would alert you early on.
These custom features allow you to go beyond the standard metrics, giving you more actionable insights into how well your forecasts translate into operational success.
How to fix it?
Once you’ve defined and monitored custom metrics for your demand forecasting model, the next step is using domain knowledge to fix any issues in downstream processes. These metrics provide a common language for teams across the business, from inventory management to logistics, helping to identify where bottlenecks or discrepancies occur.
Conclusion
In this blog, we went through possible setbacks your forecasting model can face during production and possible remedies for each.
You cannot forget about your models once deployed. What you can do is take proactive measures to set up an effective monitoring system.
To know more about tailored solutions right in your cloud platform whether its AWS or Azure, schedule a demo with our founders today.
Read More…
Check out these blogs that deep dive into monitoring similar use cases:
Reference
- Forecasting: Principles and Practice (2nd ed) Rob J Hyndman and George Athanasopoulos Monash University, Australia
- ‣
- ‣
Frequently Asked Questions
What is drift in time series data?
Drift in time series data refers to gradual changes in the statistical properties of the data over time. It can affect model accuracy by causing predictions based on past data to become less reliable.
What is concept drift in forecasting?
It occurs when P(Y∣X) changes, even if the input features' distribution P(X) remains stable. It can happen for a forecasting model, for example a dominant shift in the manner a product was being purchased, offline to online.