How to Estimate Performance and Detect Drifting Images for a Computer Vision Model?

Do not index

Canonical URL

One of the unfortunate properties of computer vision models is that performance deteriorates with time, leading to less reliable results. Since these models are trained on static images when deployed in production environments with constantly changing data, the patterns they've learned become outdated.

Think about the road sign detection model in cars. If the country decides to replace older road signs with newly designed ones, the model will have difficulty identifying them. As a result, the driver could get inaccurate speed limit information on the car dashboard, potentially leading to an accident or, at best, a speeding ticket. To prevent such a failure, the model needs a monitoring system that can detect and explain why it is inaccurate.

In this blog, we will create a system to monitor both the performance and data shifts of our satellite image classification model.

Let’s get into it!

Monitoring performance first

The key initial step of a monitoring system in production is tracking performance metrics like accuracy. When performance drops, it means there's an issue in the data that must be analyzed and resolved.

In some computer vision applications, such as quality inspection of car manufacturing parts, evaluating performance is relatively straightforward. If the model predicts good quality, the part is manually assembled into the car. If the model is uncertain, it is sent to the operator, and we get immediate feedback on its performance. However, there can be delays or even a complete absence of labels in many real-world applications. For example, consider a garbage sorting system. The computer vision model identifies the waste items, and then the robot arm can pick and sort them. Evaluating if the model is correct requires a person to label the video or inspect the sorted waste. Both of these solutions are expensive and not sustainable in production. How can we know if the model is still working well?

We can estimate the model’s performance using the CBPE(Confidence-Based Performance Estimation) algorithm.

It might look a bit complex, but I’ll try to give a clear explanation of how it works: First, CBPE breaks down predictions into confidence scores. Take the first prediction of 0.99 as an example; we expect the model to predict 99% of these observations correctly, so it assigns 0.99 to true positives. The remaining 1% model will predict incorrectly, so 0.01 is assigned to false positives. Next, using the confidence scores for all observations, CBPE constructs an estimated confusion matrix. From this matrix, it calculates any desired classification metric without the need for labels.

If you’re interested in how it works in detail and want a deeper explanation, check out this article. The key insight to remember is that CBPE only requires the model’s predictions to estimate the performance. The algorithm is available in the NannyML open-source library, which we will see how to implement in a practical section.

Now, we know how to catch any performance drop in real-time, even without the labels; the next step is to understand why it’s happening.

Understanding why performance drops

When the estimated or realized performance drops in production, it means that there’s an issue in the data that causes it. The common problem is the covariate shift, which represents the change in the joint probability distribution of input variables.

Covariate shift usually happens due to major changes in the environment. For example, in the previously mentioned road sign detection problem, the country replaced the old signs with new ones. There are many methods to detect the shift, but for this blog, we'll use another invention from NannyML's lab, an open-source multivariate drift detection algorithm.

The multivariate drift detection method.

The method allows us to look at the changes in all input features. First, the original data is compressed to the latent space using the PCA to extract the internal data structure. Then, using the inverse transform method, it’s brought back to its initial shape with certain differences.

These differences are quantified using the Euclidean distance between the original and reconstructed data, known as the reconstruction error.

The reconstruction error plot over time in NannyML.

As you can see, the reconstruction error is a single value monitored over time. If it increases beyond the given thresholds, this indicates that a covariate shift is present in our data. The method is available in the NannyML library but only works with two-dimensional data. To reduce the images into a two-dimensional shape, we can use MobileNetV2, a pre-trained convolutional neural network available across popular deep learning frameworks such as PyTorch, JAX, and TensorFlow.

The process of converting an image into a feature vector which is used for multivariate drift detection.

The first step is to pass an image to MobileNetV2, which outputs a 1000-value feature vector. Then, the feature vector is inputted into the multivariate drift detection algorithm and analyzed by NannyML. To detect the covariate shift, we will monitor the reconstruction error over time, but this time, it will be calculated from the feature vector and reconstructed feature vector. That's all the theoretical background we needed to cover; now, we can jump into our practical use case.

Monitoring a satellite image classification model

Exploring the dataset

The dataset we will use is Satellite Image Classification Dataset-RSI-CB256 from Kaggle. The objective is to classify the satellite image into one of the four classes: cloudy, desert, green area, and water.

The dataset contains 5642 images, with 1500 images for water and green areas, 1510 for cloudy, and 1132 for dessert, which introduces a slight class imbalance.

The production model is the ensemble of a pre-trained computer vision model used as a feature extractor, and the LGBM model is trained on the extracted features to classify the satellite image. The final prediction model achieved 98% accuracy on the test set.

Introducing an artificial shift

Original and shifted images for each class in the dataset.

The original dataset lacked any covariate shift that impacted the performance, so we needed to create it ourselves. In the image above, the top row shows the original image for each class, while the bottom row displays the same image but shifted. The first thing to notice is stronger red tones in the shifted images, resulting from a red color drift caused by multiplying red channel values by two. Additionally, we shifted the photos horizontally and vertically, leading to black borders on the sides.

We applied these transformations to the last 230 images, and we expect the model's performance to decrease. The question is whether our performance estimation and multivariate drift detection algorithm will detect this change.

Preparing the data

The dataset was split into three sets:

Training - 2000 examples were used to train the prediction model.

Testing set - 1500 examples were used to test the prediction model.

Production set - 2142 images were used to simulate the production environment.

Flowchart illustrating how the columns of reference/analysis sets were generated.

Then, to monitor the model in production, we needed to create two sets:

Reference set — which contains feature vector values extracted from the test set images and the model’s predictions and labels. This set establishes a baseline for every metric we want to monitor.

Analysis set — which contains feature vector values extracted from a production set with the model’s prediction and (since they’re available in this case) labels. The analysis set is where NannyML analyzes/monitors the model’s performance and covariate shift of the model using the knowledge gained from the reference set.

The resulting sets are dataframes with the model’s predictions for each class and predicted label, labels, timestamp, partition column, and 1000 feature vector values(features) obtained from the feature extraction model to detect the covariate shift.

To see all of the code, check out our colab notebook.

Estimating and calculating performance

estimator = nannyml.CBPE(
    y_pred_proba={
                 0: 'pred_proba_0',
                 1: 'pred_proba_1',
                 2: 'pred_proba_2',
                 3: 'pred_proba_3'
             },
    y_pred='y_pred',
    y_true='y_true',
    timestamp_column_name='timestamp',
    metrics=['accuracy'],
    problem_type="classification_multiclass",
    chunk_size=100 # sample size 
		)

estimator.fit(reference)
estimated_results = estimator.estimate(analysis)
estimated_results.plot()

To initialize the CBPE algorithm, we need to specify the names of the columns with the model’s predictions, ground truth, timestamp, metrics to monitor, problem type, and chunk size (sample of data) for which performance is estimated. Then, we fit the reference set and estimate the results for the analysis set.

The resulting graph shows the estimated accuracy value over time for reference and analysis period. The accuracy stays within the given threshold in both sets, but there are two dips in the last two chunks, where the artificial shift was introduced.

In that scenario, we have the actual labels available so we can evaluate if these estimations are correct.

calc = nannyml.PerformanceCalculator(
                                 y_pred_proba={
                                     0: 'pred_proba_0',
                                     1: 'pred_proba_1',
                                     2: 'pred_proba_2',
                                     3: 'pred_proba_3'
                                 },
																 y_pred='y_pred',
                                 y_true='y_true',
                                 timestamp_column_name='timestamp',
                                 metrics=["accuracy"],
                                 problem_type="classification_multiclass",
                                 chunk_size=100
																 )

calc.fit(reference)
calc_results = calc.calculate(analysis)
calc_results.compare(estimated_results).plot()

The comparison plot between estimated and realized accuracy.

The plot shows a side-by-side comparison of estimated (dark blue dotted line) and realized (straight light blue line) accuracy. We first observe the close mirroring of actual performance by the CBPE. However, in the last two chunks, CBPE underestimates the full impact of the artificial shift. This relates to the problem of model miscalibration under significant covariate shifts.

To give you an example, a well-calibrated model assigns a probability of 0.9 for a set of observations, where 90% of these observations belong to the positive class. In reality, most models are not properly calibrated. For this reason, NannyML calibrates the model using the labels from the reference set. However, due to significant covariate shift, the model can become miscalibrated in production, leading to less accurate estimation.

Nevertheless, it is quite impressive how the algorithm is able to partially capture the impact of covariate shift without looking at the incoming data itself.

Now, let’s see if the multivariate drift detection method is able to capture it.

Detecting covariate shift

mv_calc = nannyml.DataReconstructionDriftCalculator(
    column_names=features, # [feature_1, ..., feature_1000]
    timestamp_column_name='timestamp',
	  chunk_size=100
)

mv_calc.fit(reference)
mv_results = mv_calc.calculate(analysis)
mv_results.compare(realized_performance).plot()

Initialization of the method is simpler than the CBPE algorithm; we just need to specify the column names of features, timestamp column name, and chunk size. We call fit, calculate the method, and we get the results. This time, we compare the multivariate drift results with the realized(calculated) performance to see which drifts impacted the accuracy.

The comparison plot between the reconstruction error and realized accuracy.

The first observable issue is the high number of false alerts in the reconstruction error. This is a common challenge with drift detection methods, as only shifts to unseen or underrepresented regions actually impact performance. For example, the January data may contain some images with frozen water, triggering false alerts. However, if the model can generalize well, these shifts do not impact performance.

Nonetheless, we noticed one major shift on January 13th that slightly impacted accuracy. In this case, there may be more frozen lake images or green areas covered in snow - types of images that are likely underrepresented in the training set.

Even though the method triggers plenty of false alerts, when it's combined with performance estimation, it allows us to focus only on the alerts that decrease performance, as you can see in the last two chunks.

The last piece we need in our root cause analysis is finding which specific images are drifting.

Finding the shifting images

We know that the performance drops, and we know that the covariate shift is reliable for it, but we don’t know what major change in the real-world environment caused it. The only way to identify it is by examining the individual images during the shifting period. We use the previously trained PCA algorithm to calculate reconstruction errors for each image in the shifted period and select those with the highest reconstruction errors. Let’s see what they look like.

shifted_period = analysis_df[-230:]
shifted_images = []

# Use PCA algorithm from previously trained DataReconstructionDriftCalculator -> mv_calc.
pca = mv_calc._pca

def calc_reconstruction_error(feature_vector):
    reshaped_feature_vector = feature_vector.values.reshape(1, -1)
    feature_vector = pd.DataFrame(reshaped_feature_vector, columns=feature_names)
    # Perform dimensionality reduction
    latent_space = pca.transform(feature_vector)
    # Perform reconstruction
    reconstructed_feature_vector = pca.inverse_transform(latent_space)
		# Calculate reconstruction error
    reconstruction_error = np.linalg.norm(feature_vector - reconstructed_feature_vector)

    return reconstruction_error

# Iterate over the shifted period and calculate the reconstruction error for every image
for idx, row in shifted_period.iterrows():
    reconstruction_error = calc_reconstruction_error(row[feature_names])
    shifted_images.append({
										"Path": analysis_data_path["path"].iloc[idx], 
										"Reconstruction error": reconstruction_error
										})

# Get top-4 drifting images
shifted_images.sort_values(by="Reconstruction error", ascending=False).head(4)

Non-shifted and shifted images for each class in the dataset with their reconstruction error.

The top row in the image shows four random non-shifted images, each representing a different label from the analysis set, along with their reconstruction errors. The bottom row presents the top four shifted images, selected based on the highest reconstruction errors. The shifted images have significantly higher reconstruction errors compared to regular images.

In our case, the extra red colors and black borders on the images might suggest a destroyed camera. This would lead a data scientist to look for a working camera on the satellite instead. Since our shift is artificial, there isn't much to reason about; however, in real life, these kinds of changes could be more subtle and need a closer look.

Conclusions

Monitoring performance without labels and detecting covariate shift are forefront challenges in computer vision use cases. In this blog, we explored a monitoring system that addresses these problems using two NannyML algorithms: CBPE for performance estimation and multivariate drift detection to identify covariate shift in images. To complete the root cause analysis, we learned how to find the drifting images. Now, you’re equipped with the tools and knowledge to build a robust monitoring system for your computer vision application.

If you are curious about how all of this works under the hood, or if you would like to contribute to the project, please check out github.com/nannyml.