Prevent Failure of Product Defect Detection Models: A Post-Deployment Guide

Do not index

Canonical URL

The manufacturing sector has high hopes for artificial intelligence(AI). According to MarketsandMarkets, predictive maintenance and product defect detection applications are expected to account for the largest share of AI adoption in the US market.

As data scientists, our job doesn't end when the model is deployed. In fact, the fun is just about to begin. AI models can lose up to 20% of their value within the first six months if not actively monitored and managed. Continuous monitoring is essential to realizing the full benefits of the work we did pre-deployment.

The following sections will cover how defect detection models can go haywire once deployed with a real-world dataset. I will discuss actionable steps that you can refer to in order to make your models reliable in a dynamic production environment.

Monitoring is what's missing from your ML lifecycle.

Product defect detection models are designed to identify product flaws during manufacturing. These models are deployed to replace manual inspection and save time and cost. They help maintain quality standards and reduce waste, thus improving overall efficiency.

If a product defect model does not function correctly, it can result in false positives or negatives, both of which have significant consequences. Non-defective products may be unnecessarily discarded or sent for rework, increasing production costs if the model outcomes are falsely positive.

Monitoring helps catch these failures early by continuously evaluating the model's performance and making necessary adjustments to prevent issues from escalating.

To evaluate the performance of such a classification model, we need ground truth. Ground truth, or the actual state of the product (defective or not), is used to calculate various metrics to evaluate performance.

In some scenarios, ground truth may never be obtained. Not every product can be manually inspected in high-volume manufacturing environments due to time and cost constraints. Some defective products might slip through the inspection process and reach the customer. When this happens, the defect may only be discovered much later, if at all. Unidentified defects can interrupt downstream processes or end up in customers' hands, opening a Pandora's box of recalls and legal liabilities. This situation is especially problematic because the true defect rate of the products remains to be found, and the model's performance metrics may be based on incomplete or biased data.

As a result, the model might seem to perform well on paper while failing in practical applications. The absence of ground truth makes the confusion matrix censored.

💡

The censored confusion matrix is a problem in other machine learning models that are also applied in manufacturing settings. To learn more, check out my blog on monitoring predictive maintenance.

A Guide to Better Monitoring Models

Post-deployment data science involves maintaining, monitoring, and improving machine learning models after they have been deployed into a production environment. The NannyML workflow for monitoring involves continuously monitoring until we spot a drop, performing root cause analysis to identify why the performance dropped, and finally resolving the issue.

Predeployment

We will be using the comprehensive Leather Defect Detection Dataset from Kaggle.

This dataset includes 3600 labelled images of leather with various defects such as pinholes, folding marks, and loose grain.

The pre-deployment steps of model building and validation remain the standard ML lifecycle steps.

I injected covariate shift in the last 600 images by adjusting the red channel and translating the images.

Next, I used the pre-trained MobileNetV2 model to extract feature vectors from the images. I created a DataFrame to store these feature vectors along with their labels. After preprocessing the images, I added a timestamp column to create a temporal component in the dataset. The injected covariate shift will be reflected in the features. I did this to simulate a production-like environment where the newer images drift compared to the training dataset.

An LGBM classifier was trained, achieving a training accuracy of 98%.

Post Deployment

Since our confusion matrix is censored, we will use the Confidence-based Performance Estimation(CBPE) algorithm to estimate its values and, thus, its performance.

A multiclass classification model provides predictions and confidence scores for all classes. CBPE treats each defect type as a binary classification problem (e.g., "pinhole" vs "not pinhole") and uses the corresponding probability scores to evaluate the model's performance for each defect type. By calculating accuracy and other metrics for each defect type and then combining these, CBPE provides an overall estimation of the model's performance across all defect types.

estimator = nml.CBPE(
    y_pred_proba={
                 0: 'pred_proba_0',
                 1: 'pred_proba_1',
                 2: 'pred_proba_2',
                 3: 'pred_proba_3',
                 4: 'pred_proba_4',
                 5: 'pred_proba_5'
             },
    y_pred='y_pred',
    y_true='y_true',
    timestamp_column_name='timestamp',
    metrics=['accuracy'],
    problem_type="classification_multiclass",
    

)

estimator.fit(reference_df)
estimated_results = estimator.estimate(analysis_df)
estimated_results.plot().show()

The reference period is derived from the pre-deployment data, generally the testing set. As models get older, this period should be chosen from a desired benchmark dataset. The performance of the analysis period(reflects the current post-deployment scenario) is compared to the reference set, and we can see a major performance decline(remember I injected covariate shift to the last 600 images).

If you do not have ML monitoring setup, you will never find out about this performance decline before your business has taken a hit and the trust for data driven decisions is gone.

Plot comparing realised and estimated performance

The CBPE estimations (dotted line) successfully tracks the general downward trend of the realized accuracy (solid line). Having CBPE's estimations act as a proxy for realized accuracy can be a lifesaver.

Now lets discuss how you can explain the impact of model performance on business outcomes when ground truth labels are absent.

Business Value Metric and Estimations

The business_value metric provides a way to tie the performance of a model to business-related KPIs.

Its formulated as :

Let's walk through a small binary classification example to understand how this metric is calculated. Say, we are trying to automate the entire quality inspection pipeline by replacing manual workers with ML models.

For each piece classified correctly, it saves us $1 from manual inspection, and we assume the following costs for misclassification:

False Negative: Loss of $7 (due to damaged reputation, replacement costs, etc.).

False Positive: Loss of $3 (due to manufacturing costs and lost profit)

The video below shows step-by-step derivation of the business value metric.

This method helps data scientists articulate their models' financial benefits to non-data science experts. This metric can easily be interpreted with other custom business KPIs as well. You have to simply relate the KPI to each outcome of the confusion matrix. Even if the confusion matrix is censored, you can use the CBPE algorithm to estimate this business value metric.

Root Cause Analysis

The observed decline in accuracy in the analysis period can stem from various factors such as covariate shift, concept drift, or data quality issues. By identifying the root cause, we can apply targeted solutions to improve model performance.

Covariate Shift

Covariate shift is a type of data drift where the distribution of input features changes.

For a leather defect dataset, a covariate shift could occur if the images are taken under different lighting conditions, if the leather is photographed at various angles, or if the leather texture has variations not present in the original training data.

We can detect this type of shift using univariate and multivariate drift detection.

mv_calc = nml.DataReconstructionDriftCalculator(
    column_names=feature_names,
    chunk_size=200,
)
mv_calc.fit(reference_df)
mv_results = mv_calc.calculate(analysis_df)
mv_results.plot().show()

PCA Reconstruction Error method detects the artificial covariate shift that was injected

How to fix it:

Modify existing features or create new ones to better align with the current data distribution. For example, if lighting changes affect the images, applying pre-processing techniques to normalize the lighting conditions can mitigate the impact.

In Retraining is not all you need blog, Miles advises not to jump to retraining your model in case of a covariate shift. Instead, consider adjusting prediction thresholds as it might be the best option.

Concept Drift

Concept drift occurs when the model’s assumptions about the data no longer hold true. A product's lifecycle can trigger a concept shift. As products age or as manufacturing techniques improve, the types of defects or issues that arise can change.

The Reverse Concept Drift is part of the NannyML Cloud Algorithm Family, which was developed to understand the impact of concept shift on model performance. Using this, you can automate your retraining pipeline to always trigger when the shift is strong enough to negatively affect your model.

How to fix it:

When concept drift is detected, retrain the model with data that reflects the new relationship between the input features and the output. This involves collecting new labelled data under the updated defect criteria and updating the model to learn the new patterns.

In some cases, concept drift might require changes to the model architecture. For example, if new types of defects emerge that were not previously considered, expanding the number of classes to recognize these new defect types might be necessary.

Data Quality

Changes to the data collection process might cause common data quality issues in manufacturing.

Suppose you notice that the model's performance has degraded over time. Upon investigation, you find that a new camera system was installed, leading to images with different lighting conditions than the training data.

How to fix it:

Apply image preprocessing techniques to standardize the lighting conditions of all images. This could involve techniques like histogram equalization or using a consistent lighting normalization algorithm.

Coordinate with peers responsible for upstream processes so that you are notified about any change in the data pipeline.

Conclusion

This blog dissected the core challenge of monitoring defect detection models: the censored confusion matrix. We thoroughly examined the first two phases of the ML monitoring workflow—performance monitoring and root cause analysis. Additionally, we explored how business value metrics can help you articulate the financial impact of your ML models in front of non-data science experts.

NannyML provides a comprehensive toolkit for post-deployment monitoring. Our cloud product connects business needs with ML production performance.
Schedule a demo with our founders to learn how our solutions can fit your needs!

Prevent Failure of Product Defect Detection Models: A Post-Deployment Guide

Monitoring is what's missing from your ML lifecycle.

A Guide to Better Monitoring Models

Predeployment

Post Deployment

Business Value Metric and Estimations

Root Cause Analysis

Conclusion

Read More…

Join other 1100+ data scientists now!