A Brief Note on the Importance of Chunking

Do not index

Canonical URL

Monitoring machine learning models in production involves dividing data into manageable units for analysis. At NannyML, we refer to these units as "chunks”. Other industry terms are “slices”, “segments”, “partitions” or “batches” .

This blog explores the concept of chunks in ML monitoring, their importance, and the different methods to create them. We will discuss three main approaches: time based, number based, and size based, along with the option to customize the process to fit unique requirements.

What Is a Chunk?

A chunk can be defined as a smaller, manageable subset of data that is created by dividing a larger dataset. It contains rows of the data clumped together to process independently from other chunks.

In machine learning monitoring, a chunk is a subset of your model’s data over which an aggregated metric makes sense. This could include metrics like accuracy, mean feature values (e.g., average customer purchases), or other monitoring metrics.

For example, weekly or daily chunking is commonly used to monitor trends over time. Without this approach, metrics would be calculated over the entire dataset, making it harder to detect subtle trends, variations, or anomalies in the data.

NannyML provides a more structured and formalized method of chunking compared to traditional approaches. All results whether related to drift, realized performance, or business metrics are calculated and presented per chunk.

The Entire Chunking Process

Chunking begins with the reference dataset, which is divided into portions using a chosen method such as time, size, or count.

Metrics are then calculated for each chunk, and their results are processed to establish thresholds that serve as benchmarks.

The same chunking method is applied to the monitored dataset, where corresponding chunks are created, and metrics are calculated or estimated for each.

These chunk-level metrics are compared against the thresholds derived from the reference dataset. If any metric falls outside the defined range, an alert is raised.

Why Random or Stratified Chunking Is Not Ideal

Over time, the data in production can diverge from the data a model was trained on, this phenomenon is known as data drift. Monitoring helps identify when the model’s performance is undergoing a deterioration due to this drift.

The goal is to identify any signs of degradation, such as a drop in accuracy or an increase in error rates which is done chunk-wise.

If we randomly chunk the data, we might introduce noise that confuses the analysis of model performance over time. Since random chunking disregards the original order of the data, it can mix patterns, and that makes it complicated to determine whether performance issues stem from model degradation or simply from the way the data was sliced. In fact, random chunking can even create artificial drift. This would simply defeat the point of evaluating performance on the production data.

Similarly, stratified chunking would organize chunks based on specific characteristics like class distribution. It’s not ideal for post-deployment monitoring because it would again disrupt the natural order of data. You may end up with chunks that don’t represent the data as it evolves, making it difficult to see if the model is becoming less effective as time progresses.

Thus, the best approach would be to chunk it in a way that respects the original evolution of production data.

Curious how monitoring works in practice and what it takes to catch model drift before it impacts performance? See post-deployment monitoring in action and explore how it fits into your workflow with our founders.

The Importance of Chunking

Raw data can be overwhelming, especially when dealing with millions of rows.

Analyzing smaller, organized chunks eliminates unnecessary randomness that might otherwise distort your results. Chunk-level aggregation smooths out random variations, preventing false impressions of performance changes.

Chunks also make it easier to observe gradual drifts that could go unnoticed when viewing the dataset as a whole. By setting thresholds based on past patterns, chunking enables tracking model performance and raising alerts when something unusual happens. When performance issues arise, chunking helps you narrow down the problem to a specific data slice, simplifying root cause analysis.

Beyond mathematical advantages, chunking aligns monitoring results with business outcomes. It allows you to compare model performance against key business metrics. For example, by chunking daily, you can see if the revenue generated aligns with model expectations or if discrepancies suggest intervention is needed. This approach connects data insights to real-world operations, offering greater value to decision-making processes.

Different ML use cases and dataset features may call for tailored chunking methods. Let’s explore the various chunking strategies you can use with the NannyML OSS package.

All Different Ways To Chunk

NannyML is an open-source library for monitoring ML models in production. It uses chunking to compute relevant statistics about the ML model. You can bring post deployment data science in your workflow with just one pip.

$ pip install nannyml

import nannyml as nml
reference_df, monitored_df, _ = nml.load_synthetic_car_loan_dataset()

Default

Here is a code snippet that shows a generalised way to use CBPE for your model. To learn how to prepare your reference and monitored set check this tutorial.

cbpe = nml.CBPE( y_pred_proba='y_pred_proba',
                  y_pred='y_pred',
                  y_true='repaid',
                  timestamp_column_name='timestamp',
                  #chunk_number = 10 (default)
                  metrics=['roc_auc'],
                  problem_type='classification_binary',)

cbpe.fit(reference_df)
est_perf = cbpe.estimate(monitored_df) # analysis set is now called monitored set
figure = est_perf.plot(kind='performance')
display(figure)

If you don’t specify a chunking method, NannyML applies a default strategy: number-based chunking with a count of 10. This means the data is automatically divided into 10 equal-sized chunks

Estimated Performance (CBPE) Plot with default chunking i.e. chunk_number=10

Time Based Chunking

Time-based chunking splits data into chunks based on specified time intervals, such as hours, days, months, or years. This method groups observations within each interval and can result in chunks of varying sizes.

To use time-based chunking, set the chunk_period argument to your desired interval.

Estimated Performance (CBPE) Plot with chunk_period = “Q”

Time-based chunking is more intuitive for problems that naturally unfold over time, such as financial forecasting and churn prediction. Monitoring your models in this way helps you identify trends or seasonality, like a model that performs well in summer but struggles during the winter months.

However, be cautious when using time-based chunking in scenarios with irregular production data flows. If some chunks end up containing too few data points while others are large, this imbalance can introduce unnecessary statistical noise and reduce the reliability of your results.

Size Based Chunking

With this method, you can fix the number of observations that comprise each chunk across the dataset i.e. each chunk will have the same number of observations. Set this up by specifying the chunk_size parameter.

It is useful when data arrives in bursts or is processed in batches of similar sizes, like manufacturing defect detection where each batch corresponds to a set of produced goods.

If the data size per chunk is too small or inconsistent, then estimates can turn unreliable. We will discuss this scenario ahead in depth.

Estimated Performance (CBPE) Plot with chunk_size = 3500

Number Based Chunking

In this method, the entire dataset is divided into a specific number of chunks. The total number of chunks can be set by the chunk_number parameter.

Chunks created this way will be equal in size. If the number of observations is not divisible by the chunk_number required, by default, the leftover observations will form their own, incomplete chunk.

This is especially useful in cases where each chunk must represent an equal "weight" of the dataset, such as when monitoring model performance without regard to time or other groupings.

When the order of data matters, such as in temporal or sequential datasets then this method should not be your first choice. Splitting data purely by number might break meaningful sequences or groupings (e.g., splitting transactions across chunks that span multiple customers or time periods).

Comparison of chunk_number = 100 and chunk_number = 9.

Customize Chunking Behaviour

When your data or domain knowledge suggests unique patterns that standard chunking methods cannot capture, NannyML allows you to customize your chunking approach.

By creating and configuring a Chunker object with full control over its parameters, you can tailor chunk creation to your specific needs.

You can then provide this custom Chunker to the calculator, making it easier to analyze data in a way that aligns with your objectives.

While customization offers flexibility, avoid introducing unnecessary complexity or bias during chunk creation.

The behavior of incomplete chunks can be managed using the incomplete parameter. Below, we demonstrate how to define a custom Chunker, pass it to the calculator, and generate reproducible plots based on the results.

Appending Observations into the Last Chunk

from nannyml.chunk import CountBasedChunker
chunker_append = CountBasedChunker(chunk_number=9, incomplete='append')

Estimated Performance (CBPE) Plot with incomplete parameter = ‘append’

Dropping the remaining data values to keep chunk sizes uniform.

from nannyml.chunk import SizeBasedChunker
chunker_drop = SizeBasedChunker(chunk_size=3500, incomplete='drop')

Estimated Performance (CBPE) Plot with incomplete parameter = ‘drop’

Common Issues With Chunking

In Case the Number of Chunks Is Too Low

If you divide a dataset of 1000 rows into just 2 chunks, each chunk will contain 500 observations. With only two chunks, any subtle trends or variations will not be visible since data was grouped too broadly.

Gradual changes in the data could be hidden, and small fluctuations might not stand out. This lack of granularity makes it hard to detect meaningful shifts or anomalies, which could impact the reliability of your evaluation

In Case the Number of Observations per Chunk Is Low

Smaller chunks can lead to unreliable results because they are more influenced by random noise than by actual signals in the data. A dataset that is divided into chunks of just 10 observations, is more prone to have outliers and random statistical fluctuations than one with say 300 observations. These small chunks can create a misleading impression of performance issues or anomalies.

As you can observe in the image, the increase in the confidence band for a specific chunk indicates lower certainty in the metric's estimation due to fewer observations in that chunk.

Chunks are not Segments.

In NannyML Cloud,

Chunking divides your data into ordered slices, called chunks before the algorithm computes estimations. Each chunk represents a set of observations in the dataset using which we understand model performance over time.

Segmentation is a root cause analysis tool that splits your data into groups, called segments, based on unique categories in specific dataset columns. This allows you to analyze performance metrics separately for each segment after the estimations are drawn. It provides a granular understanding of how the model performs across different subsets of your data. Learn more about this feature .

Chunking at a Glance

Monitoring ML models in production requires dividing data into "chunks" for effective analysis. This blog explains time, size, and number-based chunking methods, along with customizable approaches.

Chunk Method	Parameter	When to use it?	When not to use it?
Numerical Based	`chunk_number`	- Need consistent chunk sizes for stable and comparable metrics. - Useful when monitoring model performance without regard to time or other groupings.	- If splitting data purely by number obscures trends or breaks meaningful sequences.
Size Based	`chunk_size`	- Data arrives in bursts or is processed in batches of similar sizes. - Useful in domains like manufacturing where batch sizes are fixed	- If the data size per chunk is too small, leading to unreliable estimates. - When leftover observations cannot be easily handled or are too many.
Time Based	`chunk_period`	- When data naturally unfolds over time (e.g., financial forecasting, churn prediction). - Need to analyze trends, seasonality, or performance over regular intervals.	- If production data arrives irregularly, leading to highly variable chunk sizes. - If incomplete periods result in chunks that are too small to analyze effectively.
Custom	Custom `chunker` instance modifying `incomplete` parameter	- When standard chunking methods do not align with specific data patterns or domain knowledge. - To handle unique requirements like overlapping chunks or hybrid chunking strategies.	- If customization adds unnecessary complexity. - When it introduces biases or is not well-justified for the problem being solved.

Continue Reading

Loved NannyML OSS? Dont forget to star it.

Read this tutorial to learn how to monitor an ML model end to end in Google Colab.

Monitoring a Machine Learning Model using Google Colab

In this code-along tutorial we use nannyML and Google Colab to monitor a machine learning model's performance with unseen production data.

https://www.nannyml.com/blog/tutorial-monitoring-ml-model-with-nannyml-google-colab#deploying-the-model