Harnessing the Power of AWS SageMaker & NannyML PART 1: Training and Deploying an XGBoost Model

Do not index

Canonical URL

In today's rapidly evolving world of machine learning, having the right tools and platforms is a prerequisite to staying ahead of the curve. While AWS SageMaker has emerged as a robust platform for training and deploying machine learning models at scale, there's an increasing need to go beyond just training and deploying models. Making sure these models maintain their performance over time is a pressing concern for many data scientists today. Recognizing this challenge, NannyML provides an efficient solution for continuous monitoring, ensuring that models keep performing optimally in the real world.

In this two-part blog series, we're diving deep into the heart of these technologies, aimed at providing you with a comprehensive walkthrough. By the end of this series, you'll be proficient in training, deploying, and continuously monitoring ML models using the synergistic combination of AWS SageMaker and NannyML.

Part 1 of our series, which you're currently reading, focuses on the nuances of training and deploying an XGBoost model using AWS SageMaker. Whether you're an ardent user of SageMaker or just getting your feet wet, this guide will simplify the steps, making it straightforward and digestible.

In the upcoming Part 2, we'll dive deep into monitoring. Through the lens of NannyML's SageMaker Algorithms, we will uncover the nuances of ad hoc model monitoring and the more sophisticated domain of continuous model monitoring.

To make this experience as hands-on as possible, we'll be working with the popular California House Pricing Dataset . And because we know the importance of practicality, all code and data referenced in these posts are available on our GitHub repository.

So, without further ado, let's embark on this journey and harness the full power of AWS SageMaker and NannyML.

Setting up

We'll begin by setting up a SageMaker notebook instance. A small instance like the ml.t3.medium is sufficient for this task. We'll inspect the model file produced during the training phase as we walk through the process. To do this, ensure you have the XGBoost package installed:

!pip install xgboost

Next, set up your environment and import the necessary libraries:

# Import necessary libraries and modules
import pandas as pd
import matplotlib.pyplot as plt
import sagemaker, boto3, tarfile, sklearn, xgboost
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

# Set up the SageMaker session, region, and execution role
sess = sagemaker.Session()
region = sess.boto_region_name
role = sagemaker.get_execution_role()

# Define S3 bucket details
bucket_name = 'demo-nannyml'
prefix = 'california-housing'

Fetching the data

The dataset we'll be using is a variation of the widely known California House Pricing Dataset. The objective is to predict whether a house is expensive, based on historical median house price. The dataset has already been split into training, testing, and production sets. We will send the production data to the model after it has been deployed. If you want to know how this data was created, you can find a notebook here.

# Load the data 
data_path = 'data/california-housing-dataset/{}.csv'
train = pd.read_csv(data_path.format('train'))
test = pd.read_csv(data_path.format('test'))

# Observe the data
train.sample(5)

Now, it's time to store the training data on S3:

# Create a new S3 bucket for storing data
s3 = boto3.client('s3')
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region}

# Define paths for the train, test, and prod data in S3
s3_train_path = f's3://{bucket_name}/{prefix}/train/'
s3_train_csv_path = s3_train_path + 'train.csv'
s3_test_csv_path = f's3://{bucket_name}/{prefix}/test.csv'
s3_prod_csv_path = f's3://{bucket_name}/{prefix}/prod.csv'

# Upload the data to the S3 bucket
train.to_csv(s3_train_csv_path, index=False)
test.to_csv(s3_test_csv_path, index=False)
prod.to_csv(s3_prod_csv_path, index=False)

Keep in mind that for SageMaker's XGBoost image, the training data should be formatted as follows:

The target variable should be the first column.

Instead of pointing directly to the file path (s3_train_csv_path), reference the folder containing the training data (s3_train_path)

Training an XGBoost model on AWS Sagemaker

or training, we'll utilize SageMaker's built-in XGBoost algorithm. Instead of manually locating the image URI, SageMaker conveniently constructs an XGBoost container tailored to our specifications:

# Retrieve the image URI for XGBoost in the current region
image_uri = sagemaker.image_uris.retrieve('xgboost', region, '1.7-1')

With the XGBoost image URI in hand, you can now configure the SageMaker estimator and set model hyperparameters:

# Define the XGBoost estimator with specified parameters
xgb = sagemaker.estimator.Estimator(
    image_uri,
    output_path=f's3://{bucket_name}/{prefix}/training_result',
    instance_count=1,
    instance_type='ml.m4.xlarge',
    sagemaker_session=sess,
    role=role,
)

# Set some hyperparameters for the XGBoost model
xgb.set_hyperparameters(
    objective='binary:logistic',
    num_round=100,
    max_depth=3,
)

For those keen on optimizing the model further, SageMaker offers the sagemaker.tuner.HyperparameterTuner. By providing our estimator object, we can undertake hyperparameter tuning.

Considering the XGBoost algorithm's flexibility in handling diverse data formats (csv/parquet etc.), we need to specify the dataset configuration via the TrainingInput object. Once set up, we can initiate the training job:

# Create the training data location
training_input = TrainingInput(
    s3_data=s3_train_path, content_type='csv'
)

# Train the model using the training data
xgb.fit({'train': training_input})

For documentation purposes, it's helpful to note down the location where the model file resides:

# Get the path of the trained model artifact in S3
s3_model_file_path = xgb.latest_training_job.describe()['ModelArtifacts']['S3ModelArtifacts']
print(s3_model_file_path)

Analyzing the model

Understanding feature importance aids in deciphering the model's decision-making process. We can retrieve the model from S3 and use the XGBoost package to load and analyze it. Note: for versions below v1.3-1, a different approach involving unpickling is necessary:

# Download the trained model artifact to the local environment
s3.download_file(bucket_name, s3_model_file_path.split(bucket_name + '/')[1], 'model.tar.gz')

# Extract the model artifact
with tarfile.open('model.tar.gz') as tar: tar.extractall()
    
# Load the model and set its feature names
xgb_model = xgboost.Booster()
xgb_model.load_model('xgboost-model')
xgb_model.feature_names = list(train.columns[1:])

# Plot the feature importance of the trained model
fig, ax = plt.subplots()
xgboost.plot_importance(xgb_model, ax=ax)
plt.show()

Evaluating the model

Let's check the model's performance on the test set:

# Get predictions for the test dataset
test_scores = xgb_model.predict(xgboost.DMatrix(test.iloc[:,1:]))

# Measure ROC AUC
print('Test ROC AUC Score = ', round(sklearn.metrics.roc_auc_score(y_true=test['Target'], y_score=test_scores),2))

Test ROC AUC Score = 0.78

For subsequent model monitoring with NannyML needs a reference dataset. The test set is an ideal candidate for that. Lets add our model's predictions and load them to S3.

# Assign prediction scores and predictions to the test and write to S3
test['prediction_score'] = test_scores
test['prediction'] = test_scores > 0.5
s3_reference_csv_path = f's3://{bucket_name}/{prefix}/reference.csv'
test.to_csv(s3_reference_csv_path, index=False)

Deploying the model

We can deploy the model as an endpoint with a simple command:

# Deploy the trained model as an endpoint
xgb_predictor = xgb.deploy(
    endpoint_name=prefix + '-model',
    initial_instance_count=1, 
    instance_type="ml.m4.xlarge", 
    serializer=CSVSerializer()
)

Here's how our deployed endpoint looks like:

Testing the end-point

Let's send a few test observations and compare the predictions to ensure our endpoint is functioning correctly. We do have to exclude the first column because that’s the target and the last two columns because they contain the model output:

# Make predictions using the deployed endpoint
end_point_output = xgb_predictor.predict(test.iloc[:5,1:-2], initial_args={'ContentType': 'text/csv'})

# Compare end-point predictions with test_scores
all(test_scores[:5] == [float(i) for i in end_point_output.decode().split()])

True

Our endpoint seems to be producing consistent predictions.

Cleaning up

As this is a demonstration, cleaning up resources is good practice to avoid unnecessary charges.

# Delete the deployed endpoint to prevent further charges
xgb_predictor.delete_endpoint()

Conclusion

In this walkthrough, we journeyed through a comprehensive process of building, evaluating, and deploying a machine-learning model using AWS SageMaker. Starting with setting up our environment, we fetched the California House Pricing Dataset to predict if a house would be deemed expensive. Leveraging SageMaker's capabilities, we trained our model, visualized feature importance, and evaluated its performance.

With SageMaker's seamless integration with other AWS services, we demonstrated how effortless it is to deploy our trained model as an endpoint, making it accessible for real-time predictions. A simple test confirmed the efficacy of our deployment.

Machine learning on the cloud, particularly with platforms like AWS SageMaker, brings efficiency, scalability, and robustness. For those looking to streamline their machine learning pipelines, diving deeper into SageMaker’s plethora of features would undoubtedly be beneficial.

Stay tuned for part 2, where we continuously monitor our deployed model using the NannyML Sagemaker Algorithm! If you wanna go ahead and piece together continuous monitoring yourself: here is a good starting point.