Smarter Sales Forecasting: Catching Model Failures Before It's Too Late

Do not index

Canonical URL

In machine learning, data scientists devote significant effort to developing and training models, but what happens when those models encounter the unpredictability of real-world data?

Picture this: your retail chain’s sales forecasting model, previously a reliable asset, begins to perform poorly just before the Christmas season. Popular items sell out, excess stock piles up, and customer dissatisfaction soars. The villain? A subtle shift in customer behavior that the model was not trained to handle.

And who is to save us? Model monitoring. By continuously tracking model performance and detecting drift early, businesses can adapt models to changing conditions, ensuring they remain reliable and effective.

In this blog, we will explore how monitoring tools, like NannyML, empower organizations to safeguard their models, avoid costly disruptions, and thrive in a dynamic environment.

Machine learning was revolutionary for sales forecasting due to its ability to analyse extensive historical data to uncover patterns and trends, enabling businesses to make accurate predictions and informed decisions. By improving forecasting accuracy, ML facilitates strategic advantages like efficient inventory management, cost reduction, and enhanced revenue planning. Therefore, businesses gain deeper insights into customer behavior, market dynamics, and seasonal trends, driving smarter decisions and sustainable growth.

However, unlike traditional software, machine learning models are dynamic and depend heavily on training data. Once a model is deployed, it interacts with real-world data that may evolve in ways the model was not prepared for. So, how can businesses ensure their models remain reliable? Monitoring. Monitoring for model performance ensures that any drop in accuracy or efficiency is caught early, allowing teams to address underlying issues before they escalate into costly failures.

Our scenario above highlights the critical importance of MLOps (Machine Learning Operations), a practice that manages the lifecycle of machine learning models, from development and deployment to continuous monitoring. One of the most crucial components of MLOps is ML monitoring, which ensures that models continue to perform well in production, even as new ground truth data emerges. Without proper monitoring, even the best models can suffer from data drift — a phenomenon where input data changes over time, or the relationship between input and target variables shifts — leading to model performance degradation.

By implementing performance monitoring and automated drift detection, organizations can maintain the reliability of their machine learning models, ensuring they adapt to the dynamic nature of real-world data and continue to provide valuable insights to their shareholders.

ML Workflow (Image from martinfowler.com, 2019) — *ML Workflow (Image from* *martinfowler.com, 2019)*

Having said that, it is important to mention that Data Scientists have a crucial role in ML monitoring. They are the ones best equipped to address post-deployment issues with ML models because they understand both the model's technical intricacies and the business context. While ML engineers know the technology and business stakeholders grasp goals, only data scientists bridge these areas effectively to resolve model drift and adapt to business changes.

Wiljan Cools: Who can truly fix post-deployment issues with ML models?

Keeping Your Model in Shape: The Impact of Data Drift and Quality Issues

Our sales forecasting scenario highlights why monitoring machine learning models post-deployment is essential. Models trained on historical data can become outdated as customer behavior and market conditions shift, leading to inaccurate predictions and business consequences. Monitoring is particularly crucial in use cases like sales forecasting, where precise predictions drive inventory planning, staffing, and profitability.

Traditionally, companies have relied on realized performance metrics and data drift detection. This approach focuses on comparing predictions to labeled data, often delayed, to identify drops in test performance — indicating model degradation. However, without continuous monitoring, unseen data drift or data quality issues may remain undetected.

Now, data drift poses a significant challenge.

For example, changes in customer purchasing patterns, new promotions, or seasonal shifts can make models less effective. Other issues, like corrupted pipelines or concept drift, where the relationships between inputs and outputs shift, further complicate monitoring. If not caught early, these problems can severely impact business outcomes.

On the other hand, while statistical data drift methods are helpful, they can be overly sensitive, requiring careful configuration to avoid irrelevant alerts. Drift detection should complement, not dominate, monitoring strategies. Not every drift affects performance, especially when involving less critical features.

In the following sections, we’ll explore how to go beyond traditional methods and focus on what truly impacts your model's success.

ML Monitoring Workflow with NannyML

NannyML is an open-source Python library designed for post-deployment machine learning model monitoring. It is designed to simplify the monitoring of machine learning models in production by providing comprehensive tools to track model performance and detect data drift in real-time.

Its monitoring workflow revolves around three key steps:

Performance monitoring

Root cause analysis

Issue resolution

which together ensure that models continue to perform optimally even when production data shifts over time. Have a look at the diagram below.

Santiago Víquez: Monitoring Workflow for Machine Learning Systems — Santiago Víquez: **Monitoring Workflow for Machine Learning Systems**

First off, Perfomance Monitoring

Why does performance monitoring come first? Simple. Assessing model performance and its business impact is central to this approach, as ultimately, achieving strong, consistent performance is what truly matters.

One of the standout features of NannyML monitoring workflow is its ability to estimate model performance even when actual outcomes are delayed or unavailable, a common issue in production environments. For example, in sales forecasting, it is not always possible to instantly validate predicted sales against actual outcomes, especially during long promotional periods.

The Direct Loss Estimation (DLE) algorithm, for example, is used to estimate the performance for regression models. It can estimate how closely the model’s predictions align with the expected sales for instance, without needing immediate access to real-world sales data. This allows businesses to identify performance drops early, before they become costly.

Then, Root Cause Analysis (RCA)

If performance degradation is actually found, data drift can be used to explore possible reasons for issues rather than as a primary alert system. This approach helps eliminate unnecessary alerts, allowing the team to concentrate on significant issues that genuinely affect model performance.

In truth, a critical aspect of this workflow is its ability to detect data drift, both univariate — changes in a single variable over time — and multivariate — relationships and interactions between multiple features at the same time.

In the context of sales forecasting, data drift can occur when the input data changes subtly over time, affecting the model’s accuracy. For instance, if the distribution of customer traffic or sales per store begins to shift due to changing market conditions or new promotions, the multivariate drift detection can flag these shifts early by comparing the current data with the reference data the model was originally trained on. This allows teams to catch issues like covariate shift — where the model’s inputs change, but the relationship with sales remains the same — before it negatively impacts the model’s forecasts.

Data drift (image from spotintelligence.com, 2024) — *Data drift (image from* *spotintelligence.com, 2024)*

This way, the main goal of this monitoring workflow is to help identify root causes by allowing teams to dive deeper into specific features that may be contributing to performance degradation. For example, in sales forecasting, if the model suddenly begins to under perform during a particular promotional period, NannyML can help identify whether it is due to a shift in customer traffic, changes in store promotions, or other variables such as holidays.

Last, but not least, Issue Resolution

After identifying the root cause of a performance issue, selecting the right solution depends on context and severity. This granular level of analysis shown above is crucial for understanding why the model is struggling and what actions can be taken to correct it — whether it’s retraining the model with updated data or adjusting feature importance based on recent trends.

Now that we've discussed the importance of monitoring, let's put theory into practice. In the hands-on tutorial below, you will learn step-by-step how to use NannyML OSS to track the performance of a sales forecasting model, detect data drift, and perform root cause analysis.

Hands-on Tutorial: Using NannyML to Monitor a Sales Forecasting Model

Let’s dive into our hypothetical scenario.

Imagine you’re part of the Data Science team at a large retail chain. Your team has developed a machine learning model to forecast daily sales using features like promotional activity, day of the week, and lag-based sales trends. The Gradient Boosting model you’ve implemented performs well during testing, achieving an acceptable RMSE and a good R² score. Initial production results look promising, with forecasts aligning closely with actual sales.

However, as the holiday season approaches, the model's performance begins to degrade. Shifts in customer behavior and changing sales patterns lead to less accurate predictions, resulting in inventory mismanagement and operational disruptions during the busiest shopping period. This scenario highlights the critical need for robust monitoring to detect and address issues like data drift and model degradation in real time.

In this tutorial, we will use a rich sales dataset to demonstrate how to build a reliable forecasting model and monitor it effectively.

To demonstrate model monitoring in practice, we will:

Preprocess the Sales Dataset: Filter out closed days, and engineer features for seasonality and trends.

Train a Regression Model: Build a sales forecasting model using training data from January 2013 to June 2014.

Simulate Production Data: Evaluate the model on test data from July 2014 to March 2015 and production data from April 2015 to July 2015.

Monitor: Use NannyML OSS to detect performance issues, identify data drift, and analyze potential causes of model degradation.

Let’s get started.

1. Getting the data

Sales forecasting involves analyzing time-series data, which can be, in most cases, quite tricky. If time-series data is not handled correctly, you might end up with a model that seems fine during training but fails when used in real-world situations.

What happens is that time-series data changes over time, with patterns like trends, seasonality, and stationarity that need to be considered.

Therefore, it is important to test and monitor models thoroughly, so we avoid problems like data leakage, where the model gains access to future information during training, resulting in artificially inflated performance metrics or biased model results.

So, after performing some Exploratory Data Analysis, we can visualize some information about our sales data. Here’s a summary of some initial findings:

Sales Distribution: Sales values exhibit a long-tail distribution, with a significant number of outliers beyond 13,550 units. These high-sales days often correspond to promotions or holiday periods.

Customer Distribution: The number of customers per day follows a similar trend, with extreme values occasionally exceeding 1,460.

Closed Days: Approximately 17% of records correspond to stores being closed, resulting in zero sales. These entries were excluded for modeling purposes.

Seasonality: Sales and customer patterns vary based on the day of the week, month, and quarter, highlighting the importance of including these temporal features in the model.

Promotions: Promotional activities significantly influence sales, with spikes observed during promotional periods.

Monthly sales dropped from 2013 to 2014 across all stores, but may recover in 2015. The month of December has a much higher mean of sales across stores — although March showed the highest sum of sales.

To account for the seasonality in the data and improve the correlation matrix, we engineered the following features:

sales_lag_1: Captures short-term daily seasonality by including the sales value from the previous day.

avg_sales_month: Tracks longer-term trends by calculating the average sales for the same month across different years.

avg_sales_day_week: Reflects shorter-term patterns by averaging sales for the same day of the week over successive weeks.

All features were calculated by grouping the data by store_id, ensuring the seasonality adjustments are tailored to each store's unique sales patterns.

So, our final dataset and its correlations look like the following:

df_model[['date','sales', 'promotion', 'week', 'day_of_week',
                         'sales_lag_1', 'avg_sales_month', 'avg_sales_day_week']]

2. Building the Model

Firstly, it's crucial to preserve the chronological sequence of observations, as disregarding it can result in substandard model performance or, even more concerning, completely inaccurate predictions.

So before anything else, when training an ML model, we often split the data into 2 (train, test) sets. But, since the goal of this tutorial is to learn how to monitor the performance of our ML model predictions on unseen “production” data, we will actually split the original data into three sets — and more importantly — using chronological split — this ensures that the model never has access to future information during training:

Train set: Data from 01-01-2013 to 30-06-2014. Training size: 317,373 rows.

Test set: Data from 01-07-2014 to 31-03-2015. Testing size: 144,412 rows.

Prod set: Data from 01-04-2015 to 31-07-2015. Production size: 70,141 rows.

We started the tests on a Decision Tree and then used GradientBoostingRegressor with RandomizedSearchCV and TimeSeriesSplit for cross-validation because it respects the temporal nature of the data, prevents data leakage, and better simulates real-world forecasting scenarios. Once the model was refined and trained, we made predictions on the train and test data to evaluate it. And what we got was:

R2 value of 0.88,

Root Mean Squared Error of 1061.4 (currency units) on the test set — which we can consider quite decent since it considerably outperforms the rmse_test_baseline of 3116.8.

Below we plotted two scatter plots that compare the RMSE of the model's predictions to the baseline RMSE (where every prediction is equal to the mean of the training data) for both train and test data.

As we can see, this visualization confirms that the model generalizes well to new data, making it suitable for production use. However, there are some predictions that deviate more significantly from the actual values in the test set, possibly due to unseen patterns or noise in the test data — which highlights the importance of monitoring for potential drifts in data when deployed.

3. Deploying the Model

In order to simulate a production environment, we will use the trained model to make predictions on unseen production data — the ‘prod set’ as determined above. We will then use NannyML to check how well the model performs on this data.

We had to perform exactly the same type of cleaning, preprocessing and feature engineering on this part of the dataset in order to be able to use our model to make new predictions.

Now, let’s go over some important steps in our monitoring workflow.

To analyze our machine learning model performance in production using NannyML, two datasets are required: a reference dataset and an analysis dataset.

The reference dataset serves as a baseline, containing data where the model performed as expected and includes known targets and predictions. This is usually the test set, which the model did not see during training. As defined above this gathers data between 01-07-2014 to 31-03-2015.

The analysis dataset consists of recent production data — our prod set containing data between 01-04-2015 to 31-07-2015 — and doesn't require targets. It is used to monitor performance and detect data drift based on insights from the reference dataset.

# Reference: contains true labels (y_test) and predictions (y_pred_test)
reference = test_set.copy()
# Analysis: contains predicted labels (y_pred_prod)
analysis = prod_set.copy()

# Both need to contain the variables 'y_true' and 'y_pred' under the same name

4. Monitoring Workflow

Now is the moment when we have to answer our three main questions:

Is the model performing well?

If not, what went wrong?

How can we fix it?

Setting Up

Let’s start by setting up our library. Below is an example of how to install and integrate NannyML OSS with the existing sales forecasting pipeline.

%pip install nannyml 
import nannyml as nml

Estimating the Model Performance with DLE

As mentioned before, when a machine learning model is in production, it is important to monitor its performance, but actual performance cannot always be measured right away due to delays in obtaining target values (e.g., amounts of daily sales).

To address this, the Direct Loss Estimation algorithm can be used to estimate performance without waiting for the actual targets. DLE trains an additional model to estimate the loss function of the monitored model, providing an estimate of its performance. This method is useful for regression tasks, like the one in this tutorial.

features = ['promotion','sales_lag_1', 'avg_sales_month', 'avg_sales_day_week']

dle = nml.DLE(
    y_pred='y_pred',  # The column with your model's predictions
    y_true='y_true',  # The column with the true values 
    feature_column_names=features,
    timestamp_column_name='date',
    chunk_period='m',   # Performance in a monthly basis ('w' for weekly basis)
    metrics=['rmse'])  # The performance metrics you want to track

# Fit DLE on the reference (test) dataset where true labels are available
dle.fit(reference)
estimated_performance_rmse = dle.estimate(analysis)
estimated_performance_rmse.plot()

These two graphs represent the same metric - RMSE - but at different levels of granularity: monthly in the first graph and weekly in the second.

The model demonstrates consistent performance over time, with one notable spike in errors around December 2014, likely driven by disruptions in customer behavior, such as increased sales variability during the holiday season. Despite this spike, the monthly estimated performance remains within acceptable thresholds, indicating that the model operates effectively during both the reference and analysis periods.

While both graphs highlight the holiday-related performance issue, the weekly graph provides more detailed insights, pinpointing a sharp RMSE spike during the week of 22-28 December 2014 (as seen in the drift alert). This granularity allows for better identification of short-term issues, making the weekly view ideal for operational monitoring and troubleshooting, while the monthly graph excels at providing a cleaner, high-level summary for strategic planning.

Detecting Multivariate Drift with the Drift Calculator

Multivariate data drift detection provides an overview of changes across all features in a dataset, detecting shifts in the general feature distribution rather than examining each feature independently. This method captures subtle changes in data structure, such as shifts in relationships between features, which univariate methods might miss.

To detect drift, we will use the DataReconstructionDriftCalculator method. It applies PCA (Principal Component Analysis) — an ML method that breaks down variables into a subset of linearly independent principal components — to compress the reference data into a latent space, then decompresses it to calculate reconstruction error. If the reconstruction error for production data exceeds a threshold, it indicates data drift, meaning the production data structure no longer matches the reference data.

drdc = nml.DataReconstructionDriftCalculator(
    column_names=features,
    timestamp_column_name='date', 
    chunk_period='m')  # Performance in a monthly basis ('w' for weekly basis)

drdc.fit(reference)

# Estimate drift on the production dataset
drift_results = drdc.calculate(analysis)

# Plot the drift results
drift_results.plot()

These graphs show the PCA Reconstruction Error on a monthly and weekly basis, illustrating how the metric evolves over time and detecting potential shifts in data distribution. The PCA algorithm measures how well the current data aligns with the reference data. A low error indicates consistency in the data structure, while a high error signals a potential shift.

Drift Alert: During the reference period, there are two points where the reconstruction error exceeded the upper threshold, indicating significant data drift.

Here’s how the two graphs compare:

In December 2014, the monthly graph highlights a clear spike in reconstruction error, signaling a major data distribution shift. This drift is likely driven by holiday-related changes in customer behavior and sales patterns. The weekly graph, however, provides a more detailed view, pinpointing the drift to the specific weeks of 15-21 December and 22-28 December. This finer granularity allows for a closer examination of the timing and magnitude of the drift.

Another point to consider is that for the week of 30 March to 5 April, 2015, there’s a noticeable spike in the PCA Reconstruction Error as well, indicating another significant shift in data distribution. This spike suggests that the underlying structure of the data during this period deviated from the reference set, potentially due to unexpected changes in customer behavior, sales trends, or operational factors such as promotions or holidays leading up to Easter, for example. If promotions or marketing campaigns were atypical compared to previous years, they might have introduced patterns the model was not trained to handle, triggering the drift.

These insights highlights the importance of accounting for event-driven seasonality when analyzing data, building and monitoring models. Monitoring sharp changes like these can help businesses adjust their strategies and recalibrate models to better account for short-term anomalies, ensuring more accurate predictions and stable performance during critical weeks.

Univariate Drift Detection

Univariate drift detection enables a more detailed analysis by examining each feature separately. It isolates significant changes in individual features over time, complementing the multivariate drift detection results. By focusing on specific features, we gain deeper insights into the causes and timing of the data drift.

udc = nml.UnivariateDriftCalculator(
    column_names=features,
    timestamp_column_name='date',
    chunk_period='m',)# Performance in a monthly basis ('w' for weekly basis)

udc.fit(reference)
univariate_data_drift = udc.calculate(analysis)

for feature in features:
    display(univariate_data_drift.filter(period='all', metrics='jensen_shannon', 
				    column_names=[feature]).plot(kind='distribution'))

Again, we spotted some red alerts in our model’s dataset. Let’s interpret them:

For the feature avg_sales_day_week, which reflects shorter-term weekly sales patterns, drift alerts were observed in both the monthly and weekly graphs. In the monthly view, alerts appeared during June and July 2015, likely driven by seasonal factors. The weekly view provides a more granular breakdown, identifying specific drift periods in March 30 to April 5, 2015, and July 27 to August 2, 2015. These anomalies indicate shifts in weekly sales behavior, possibly tied to events such as Easter and mid-year promotions.

The feature sales_lag_1, which captures short-term daily seasonality, flagged a drift alert during the week of March 30 to April 5, 2015 as well. This aligns with anomalies detected in the avg_sales_day_week feature and in the multivariate analysis, further underscoring the significance of this period.

These insights highlight the importance of univariate drift detection in identifying specific features and periods of instability. For instance, the recurring drift during late March and early April emphasizes the need to account for event-driven seasonality, such as Easter, in predictive models. Similarly, the mid-year drift in avg_sales_day_week suggests adjustments are necessary to accommodate seasonal trends in weekly sales patterns.

One step further: root cause analysis

The graphs below provide complementary insights into the anomalies and challenges of estimating performance in April 2015, focusing on the relationship between promotional activity and sales trends over three years (2013, 2014, and 2015).

Monthly Promotion Count:

The first graph shows that April 2015 had the lowest count of promotions compared to the same month in 2013 and 2014. This would typically suggest a decline in sales activity, as promotions are a primary driver of customer engagement.

Sum of Sales:

Contradicting the expectation set by the low promotion count, the second graph reveals that April 2015 had the highest total sales among the three years. This unexpected spike indicates that factors other than promotions likely drove the sales increase, such as organic demand, external events, or market trends.

This mismatch between promotion count and sales performance likely caused confusion for the model, as it disrupted the typical relationship the model had learned from historical data. Specifically:

The model may have relied heavily on promotions as a key feature for predicting sales. The sudden decoupling of promotions and sales in April 2015 introduced a data drift, where the patterns in the training data no longer aligned with real-world behavior.

The unexpectedly high sales despite reduced promotions would have appeared as an anomaly, making it challenging for the model to generalize accurately to this scenario.

To sum up, these graphs highlight the importance of identifying and addressing non-linear relationships or unexpected data shifts in predictive modeling. To better handle such scenarios in the future, the model could benefit from:

Incorporating additional features (e.g., external factors like holidays or market trends) to capture variables that influence sales beyond promotions.

Regularly monitoring feature importance and potential data drift to detect and adapt to changes in feature behavior over time.

Realized Performance vs. Estimated Performance

Now comes another very useful feature of NannyML monitoring system. Once the target values are available (we have actual daily sales values), we can assess the actual performance of the model on the production data, known as realized performance.

In the following step, we calculate the realized performance and compare it with NannyML's estimated performance.

# assign true values to 'analysis'
analysis['y_true'] = prod_set['y_prod']

perfc = nml.PerformanceCalculator(
    metrics=['rmse'], 
    y_true='y_true',
    y_pred='y_pred',
    problem_type='regression',
    timestamp_column_name='date',
    chunk_period='m') # Performance in a monthly basis ('w' for weekly basis)

# Fit the performance calculator on the reference data
perfc.fit(reference)

# Calculate realized performance on production data
realized_performance_rmse = perfc.calculate(analysis)

# plot estimated vs realized performance comparison
estimated_performance_rmse.compare(realized_performance).plot().show()

These two graphs, showing the estimated RMSE versus the realized RMSE, provide complementary insights into how well the DLE approach can predict model performance over time. The monthly view aggregates broader trends, while the weekly view captures more detailed fluctuations.

Here’s what we can conclude:

The monthly graph highlights overall trends in estimated versus realized RMSE, showing that DLE performs well under stable conditions, with estimated RMSE closely aligning with realized RMSE. However, it underestimates significant deviations, such as the holiday-related spike in December 2014 and the anomaly in April 2015.

The weekly graph provides a more detailed view, pinpointing short-term fluctuations. It reveals the exact weeks of significant drift, such as December 22-28, 2014, and March 30-April 5, 2015, which are smoothed out in the monthly view. These short-term spikes highlight the limitations of DLE in handling unseen covariate shifts, when the distribution of variables in the real-world is markedly different from the training data.

These two perspectives demonstrate the granularity trade-of again: the monthly view is ideal for tracking major trends and ensuring overall stability, while the detailed weekly view uncovers precise anomalies and operational insights. Together, they offer a comprehensive understanding of model performance, enabling both strategic oversight and tactical troubleshooting.

Conclusion

In this article, we explore how machine learning models, especially those used in sales forecasting, can fail without proper monitoring post-deployment. Using a sales forecasting model as an example, we illustrate how tools like NannyML help businesses track performance, detect data drift, and prevent model degradation over time. With real-world applications, such as preparing for high-demand periods like holidays or promotions, this article emphasizes the importance of continuous monitoring to ensure accurate and reliable predictions.

Ready to safeguard your machine learning models?

Explore NannyML’s GitHub or check out the documentation to start integrating advanced monitoring today!