Table of Contents
Do not index
Do not index
Canonical URL
In this tutorial, we'll step into the shoes of a data scientist to develop a machine learning solution for two hotels in Portugal. Our objective is to deploy a monitoring algorithm for a model that determines whether a room reservation will be canceled or not. This series is split into two parts:
- Part 1: Creating the Dataset - we will go through the process of training the model and preparing the dataset for monitoring in production.
- Part 2: Performance Estimation in SageMaker - we will simulate the production environment and deploy NannyML’s algorithm to estimate the performance without the ground truth.
Let’s get into it!
Hotel booking dataset
The dataset is available on Kaggle, and it contains 119390 hotel bookings with 32 features. Our target is a binary value with information on whether the booking was canceled or not.
Predicting hotel booking cancellations is a problem that many hotels and travel companies face.
Not knowing if customers will cancel their booking is a common problem that many hotels and travel companies face. It is important for them to be able to accurately predict whether a booking will be canceled, as this allows them to improve room allocation and optimize their revenue.
As mentioned before, the data comes from two hotels: the Resort Hotel from Algarve(southern region of Portugal) and the City Hotel located in Lisbon. The arrival dates of bookings range from July 2015 to August 2017. If you want to know more about the dataset, the info and descriptions of all features are available in the dataset’s paper.
Notebook Walkthrough
In this section, we will go through the whole process of the ML model lifecycle, starting from raw data and ending with analysis and reference sets for NannyML algorithms.
Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
from lightgbm import LGBMClassifier, plot_importance
from sklearn.metrics import roc_auc_score
Loading the data
The dataset is downloaded locally from the Kaggle in
CSV
format and then loaded with Pandas.data = pd.read_csv("hotel_bookings.csv")
data.shape
# (119390, 32)
Preprocessing the data
Before modeling, we will do some preprocessing:
- Adding timestamp column
The timestamp column tells us when each data point happened. According to the paper, for each booking, we consider a date that's one day before the actual arrival date. In our dataset, we break down the arrival date into three parts:
arrival_date_year
, arrival_date_month
, and arrival_date_day
. In the code below, we combine these parts to create the arrival date and then subtract one day from it.data['timestamp'] = data['arrival_date_year'].astype(str) + "-" + data['arrival_date_month'].astype(str) + "-" + data['arrival_date_day_of_month'].astype(str)
data['timestamp'] = pd.to_datetime(data['timestamp']) - timedelta(days=1)
data.sort_values(by='timestamp', inplace=True)
- Picking data that wasn’t cancelled 24h before the arrival
We're trying to predict whether a booking gets canceled 24 hours before the arrival. However, our data includes cancellations that occurred well before this 24-hour window. To solve this, we only want data from within this 24-hour timeframe.
The
reservation_status_date
column helps us track when a booking's status changes to cancellation, check-out (meaning the person stayed at the hotel), or no-show (when a person doesn't show up at the hotel). Since check-out and no-show can't happen before the arrival date, we can simply select values that are greater than or equal to the timestamp
.data = data[data['reservation_status_date'] >= data['timestamp']]
- Dropping columns with data leakage
The
reservation_status
column shows the last booking status, including cancelation, check-out, and no-show. This information directly tells us if the booking was canceled or not, also known as data leakage. Moreover, if the model were in production, this data would not be directly available after placing a booking, only after the customer cancels, check-out or no-show. Therefore, we must remove this column, along with
reservation_status_date
and arrival_date_year
.data = data.drop(columns=['reservation_status','reservation_status_date', 'arrival_date_year'])
- Creating partition column
- train: data from 2015 until April 2016
- test: data from May 2016 to September 2016
- production: from October 2016 data to the August 2017
When training an ML model, we often split the data into 2 (train, test) or 3 (train, validation, test) sets. But, since the final goal of this tutorial is to monitor an ML model with unseen production data, we will split the original data into three parts:
conditions = [
(data['timestamp'].dt.year.isin([2015])) | ((data['timestamp'].dt.year == 2016) & (data['timestamp'].dt.month < 5)),
(data['timestamp'].dt.year == 2016) & ((data['timestamp'].dt.month >= 5) & (data['timestamp'].dt.month <= 9)),
((data['timestamp'].dt.year == 2016) & (data['timestamp'].dt.month > 9)) | (data['timestamp'].dt.year == 2017)
]
partitions = ['training', 'testing', 'production']
data['partition'] = np.select(conditions, partitions)
- Correcting variable types
To make the LGBM model and NannyML algorithm work correctly, we need to use the
category
type for categorical variables instead of object
.categorical_cols = ['hotel', 'meal', 'country', 'market_segment',
'distribution_channel', 'reserved_room_type',
'assigned_room_type', 'deposit_type',
'customer_type', 'arrival_date_month']
data[categorical_cols] = data[categorical_cols].astype('category')
Splitting the data
The production dataset will help us simulate a real-case scenario where a trained model is used in a production environment. Typically, production data don’t contain targets. This is why monitoring performance in production is a challenging task.
train = data[data["partition"] == "training"]
X_train = train.drop(columns=["is_canceled", "timestamp", "partition"])
y_train = train["is_canceled"]
test = data[data["partition"] == "testing"]
X_test = test.drop(columns=["is_canceled", "timestamp", "partition"])
y_test = test["is_canceled"]
prod = data[data["partition"] == "production"]
X_prod = prod.drop(columns=["is_canceled", "timestamp", "partition"])
y_prod = prod["is_canceled"]
Observing the targets
The target distribution is highly imbalanced, which makes sense because most people don't cancel their bookings within 24 hours of arrival. Just to give you an idea, only 6% of cancellations occur during that time frame. I also changed numerical labels to
No
and Yes
for better visualization.train['is_canceled'].replace({0: 'No', 1: 'Yes'}).value_counts(normalize="true", ).round(2).to_frame("target_distribution")
Building the model
We're using an LGBMClassifier for training with an extra parameter called
class_weight
to address the target imbalance. We've also set max_depth
to 2 to help prevent overfitting.model = LGBMClassifier(random_state=111, class_weight="balanced", max_depth=2)
model.fit(X_train, y_train)
y_train_pred_proba = model.predict_proba(X_train)[:, 1]
y_test_pred = model.predict(X_test)
y_test_pred_proba = model.predict_proba(X_test)[:, 1]
Â
Understanding the model
Let’s start with plotting the feature importance graph.
plot_importance(model, max_num_features=5)
Two features that have the biggest impact on the model’s performance are
country
and lead_time
. Let’s take a closer look and reason why.train_data["country"].value_counts(normalize="true", ).round(2).to_frame("Countries distribution")[:5]
There are 177 different countries in the dataset, with the top 5 being Portugal, UK, France, Spain, and Germany. Portugal stands out as the most dominant country, which makes sense given that the hotels are located there. This could also mean that local residents may have different cancellation behavior compared to other visitors.
filtered_data = train_data[['country', 'is_canceled']]
filtered_data['country_group'] = 'Others'
filtered_data.loc[filtered_data['country'] == 'PRT', 'country_group'] = 'Portugal'
ax = sns.countplot(x='country_group', hue='is_canceled', data=filtered_data)
ax.set_xlabel('Country')
ax.set_ylabel('Bookings')
ax.set_title('Booking Cancellation by Country')
plt.savefig("portugal_and_others.png")
plt.show()
The graph illustrates booking cancellations for Portugal and other countries. It's evident that distribution of cancellations by local residents is more balanced than for other countries. Travelers from other countries usually arrive at the hotel and rarely cancel it.
One possible reason for this could be that canceling for international guests might involve also canceling their flights, trains, and other travel arrangements, which makes them more committed.
Now, let's check out the
lead_time
feature.# Filter the data based on 'is_canceled' values
canceled_0 = train[train['is_canceled'] == 0]['lead_time']
canceled_1 = train[train['is_canceled'] == 1]['lead_time']
# Create subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))
# Plot histogram for 'is_canceled' == 0
axes[0].hist(canceled_0, bins=20, color='green', alpha=0.7)
axes[0].set_title('Lead Time Histogram for Completed Bookings')
axes[0].set_xlabel('Lead Time')
axes[0].set_ylabel('Bookings')
# Plot histogram for 'is_canceled' == 1
axes[1].hist(canceled_1, bins=20, color='red', alpha=0.7)
axes[1].set_title('Lead Time Histogram for Canceled Bookings')
axes[1].set_xlabel('Lead Time')
axes[1].set_ylabel('Bookings')
plt.tight_layout()
plt.show()
The
lead_time
feature shows the number of days between booking and the arrival date. In the first plot, you can see a histogram of lead times for completed bookings. It's clear that most bookings are made around 50 days before arrival. Interestingly, most canceled bookings occur when they are made within 50 days of arrival. What's worth noting is that there are no canceled bookings made 300 days before arrival, which is understandable because people planning a trip a year in advance typically have everything well scheduled and planned out. Making predictions
y_train_pred_proba = model.predict_proba(X_train)[:, 1]
y_test_pred = model.predict(X_test)
y_test_pred_proba = model.predict_proba(X_test)[:, 1]
Evaluating the model
print('Training ROC AUC : ', roc_auc_score(y_train, y_train_pred_proba))
# Training ROC AUC : 0.89
print('Testing ROC AUC : ', roc_auc_score(y_test, y_test_pred_proba))
# Testing ROC AUC : 0.81
However, there's still a gap between the ROC AUC results in training and testing sets, suggesting that the model might be overfitting, or the characteristics of the test set are different.
Deploying the model
The next step after model understanding is to deploy it to production. To simulate that environment, we will use the model to make predictions on unseen production data.
y_prod_pred = model.predict(X_prod)
y_prod_pred_proba = model.predict_proba(X_prod)[:, 1]
Creating reference and analysis set
We need to create a reference and analysis set to properly analyze the model performance in production. Here is a quick recap of what both of these sets are:
- Reference dataset:Â NannyML uses the reference set to establish a baseline for model performance and drift detection. Typically it's a test set that needs to include the features, model outputs of the monitored model, and the groud truth.
- Analysis dataset: It is the latest production data. NannyML checks on that period whether the model maintains its performance and if the feature distributions have shifted. Since we can estimate the performance, it doesn’t need to include the ground truth values.
reference = X_test.copy() # using the test set as a reference
reference['is_canceled'] = test['is_canceled'] # ground truth values
reference['y_pred'] = y_test_pred # reference label predictions
reference['y_pred_proba'] = y_test_proba_pred # reference label probabilities
reference = reference.join(test['timestamp']) # timestamp column
analysis = X_prod.copy() # features
analysis['is_canceled'] = prod['is_canceled']
analysis['y_pred'] = y_pred_prod # analysis label predictions
analysis['y_pred_proba'] = y_prod_proba_pred # analysis label probabilities
analysis = analysis.join(prod['timestamp']) # timestamp column
Exporting data
The final step is to export our reference and analysis to the
CSV
file.reference.to_csv("reference.csv", index=False)
analysis.to_csv("analysis.csv", index=False)
Conclusions
Not knowing if customers will be cancel their booking is a common problem in hotels. In this blog post, we explored a hotel booking dataset, discovering insights into different customer behaviors and their effects on predictions. We also walked through the data preparation process for NannyML, starting with raw data and ending with reference and analysis sets.
In the next part, we'll evaluate how well our model works in practice without ground truth, using NannyML's performance estimation algorithm in SageMaker!