In recent years, AWS customers have been running machine learning (ML) on an increasing variety of datasets and data sources. Because a large percentage of organizational data is stored in relational databases such as Amazon Aurora, there’s a common need to make this relational data available for training ML models, and to use ML models to make predictions in database-based applications. This post shows how to easily extract your production data from Aurora, train an ML model in Amazon SageMaker, and integrate the model inferences back into your production database and applications. It extends a popular ML use case, predicting customer churn, and demonstrates how to achieve the real business goal of preventing customer churn. We’ll use a large phone company as our setting.
At our telco company, our CEO called us all into a meeting. “We have around 15% of customers leaving our service, or “churning”, every year! Losing customers and then acquiring new customers is expensive. It’s a major drag on our annual results! Can we predict which customers are likely to churn, and give them incentives to stay? Can we use machine learning (ML) to help?”
Here’s our ensuing discussion, condensed for brevity.
“Well,” said our ML engineer. “I understand that all of our customer data is stored in an Amazon Aurora relational database. If the DBA can get me the data, I can build an ML model to predict which customers will churn. I can use an Amazon SageMaker XGBoost Built In Model for that – it’s an algorithm commonly used for regression, classification, and ranking problems. And with SageMaker automatic model tuning I should get a pretty good model.”
The DBA said, “Sure! I can give you a dump of part of our production database. AWS offers S3 download from Amazon Aurora, so that makes it easy.”
The CEO roared, “I don’t want to predict who is leaving! I want to prevent them!”
“If you can tell us what factors are most tied to churn, we could build some targeted incentive programs,” said Marketing.
“Hmmm,” said our ML engineer. “Once I have the model, I can tell you the most important factors, or features, as we call them, used by the model. We call that ‘explainability’.“
Customer Service chimed in, “We can also look at our customer service chats for clues about each customer’s sentiment. However, the number of chats is too big for us to read manually. If an ML-based system can tell – in the first sentence of the chat, say – that a customer is not happy or they’re likely to churn, then we could offer them one of those incentives on the spot!”
“I can use Amazon Comprehend to tell you the customer’s sentiment. That’s easy,” said the ML engineer.
The DBA chimed in, “Then with Amazon Aurora Machine Learning, we can just write a SQL query that returns the customer sentiment and churn prediction along with the basic customer information at the time you start the chat. That way you’ll have all the customer details at your fingertips. We can then use the programs that Marketing gives us to offer an incentive to the likeliest churners.”
“Great!” said the CEO. “Make it so!”
In the rest of this blog, we’ll walk you through the steps required to do so.
Solution overview
In most organizations, a database admin (DBA) or an application developers works with a database, and the ML engineer works with the ML model. However, one person can perform all the steps to implement the solution we describe here. Because the solution is implemented using the Amazon SageMaker built-in XGBoost algorithm, more in-depth data scientist skills are not needed in this case.
The following diagram illustrates the AWS services used to implement the solution.
You get access to ML inferences from your production database with the following steps:
- The data of interest is originally stored in an Aurora MySQL (A similar process can be used for the PostgreSQL-compatible edition of Aurora.) The DBA uses a SQL
SELECT INTO OUTFILE S3
statement to unload data from Aurora to an S3 bucket. - The ML engineer uses a Jupyter notebook hosted on an Amazon SageMaker notebook instance to read and explore the data from Amazon S3. AWS Secrets Manager is used to securely hold the Aurora access credentials.
- From the Jupyter notebook, the ML engineer partitions the data into separate sets of data for use in training and testing the ML model. The ML engineer writes the training data back to Amazon S3 to use during model training. (You could also perform partitioning during the initial data unload from Aurora.)
- From the same notebook, the ML engineer starts a training job using the Amazon SageMaker built-in XGBoost algorithm.
- Amazon SageMaker reads the training data from Amazon S3.
- Amazon SageMaker trains the model.
- The ML engineer checks the model characteristics. When satisfied, she creates an Amazon SageMaker endpoint to serve predictions from the trained model and sets up the connection between the database and the endpoint.
- The application (or the DBA) executes a SQL query in Aurora.
- Aurora passes on the request for the churn prediction to the Amazon SageMaker endpoint.
- Aurora passes on the request for the sentiment prediction on text to Amazon Comprehend.
- Aurora assembles the results and returns them with data from its own tables, as a SQL query result.
This post uses a dataset from the AWS Samples Customer Churn sample notebook. It’s publicly available and is mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It’s attributed by the author to the University of California Irvine Repository of Machine Learning Datasets.
What makes our life much easier is that Aurora has a feature called Aurora machine learning, which was launched at re:Invent 2019. This feature is a pre-built integration with Amazon SageMaker and Amazon Comprehend that calls these services within SQL queries and takes care of transferring the data efficiently. For instructions on configuring Aurora ML capabilities, see Enabling Aurora Machine Learning. This post creates and configures these components for you via an AWS CloudFormation template. You can also find a copy of all the code in AWS Samples: Gain Customer Insights.
Setting up the solution infrastructure
To set up the infrastructure components and configure the connections, launch a predefined CloudFormation stack. The stack sets up the following resources:
- An Amazon VPC in which to run the components, and two private subnets and one public subnet with a NAT gateway.
- An Amazon Aurora cluster containing a single database instance, and a security group that is configured to give access to the Amazon SageMaker notebook instance. The cluster resides in the private subnet. This database instance isn’t set up as a production configuration. For more information, see Category: Amazon Aurora and AWS Quick Starts for Aurora. The stack also creates a DB cluster parameter group with appropriate parameter settings, specifically the following:
aws_default_comprehend_role
is set to an IAM role that gives Aurora permission to call the Amazon ComprehendDetectSentiment
andBatchDetectSentiment
APIs.aws_default_sagemaker_role
is set to an IAM role that gives Aurora permission to invoke the Amazon SageMaker endpoint that we’ll use for our model.aws_default_s3_role
is set to an IAM role that permits access to the S3 bucket we’ll use for our data.
- An Amazon SageMaker notebook instance with the associated components:
- An IAM role for use by the notebook instance. The IAM role has the managed role
AmazonSageMakerFullAccess
, which includes access to S3 buckets containing SageMaker in the name. This role may have more permissions than you wish to use in production. - A security group used for the notebook instance.
- An Amazon SageMaker lifecycle configuration that copies in three predefined Jupyter notebooks with the sample code. It also creates a file called
cloudformation_values.py
, which contains the values of key resources created by the stack, such as the S3 bucket name and the secret name that contains the Aurora hostname and the password. This file is used to pass these values to the Jupyter notebooks.
- An IAM role for use by the notebook instance. The IAM role has the managed role
- An Amazon SageMaker endpoint. The endpoint is listed in the IAM role specified in the DB cluster parameter group parameter
aws_default_sagemaker_role
. - An S3 bucket with
SageMaker
in the name so the Amazon SageMaker default IAM role can use it. - A Secrets Manager secret, with the database username, password, and hostname. Using the secret means that the username and password aren’t stored in the notebook.
To see this solution in operation in the US East (N. Virginia) Region, launch the stack from the following link. The total solution costs approximately $1.00 as a one-time charge plus $1.15 per hour to run. Remember to delete the AWS CloudFormation stack when you’ve finished with the solution to avoid additional charges.
To run the stack, complete the following steps:
- Choose Launch Stack and choose Next.
- Update the following parameters for your environment, or leave the defaults:
- DatabaseAdministratorUsername
- SageMakerInstanceType
- RDSInstanceType
- Choose Next.
- Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
- Choose Create.
- Wait for the CloudFormation stack to reach a status of
CREATE_COMPLETE
(up to 15–20 minutes). - Check the Outputs tab for the resources created. See the following screenshot.
- On the Amazon SageMaker console, choose Notebook instances.
- Locate the notebook that starts with
PreventChurnSMNotebook-
in the list of notebook instances. - Choose Open JupyterLab.
You should see a list of files copied into the notebook for you, similar to the following screenshot.
The following sections walk you through each notebook.
Setting up and unloading the data
Setting up the Amazon Aurora database and saving the data to Amazon S3 for use in ML are steps that the DBA would normally perform, often using a SQL client. For consistency, this post demonstrates the steps in a Jupyter notebook using mysql.connector
Python module.
The first notebook, part_1_preventing_customer_churn_Amazon_Aurora_setup.ipynb
, is divided into two sections. For a pre-executed HTML version, see Preventing Customer Churn, Part 1. Connect to Amazon Aurora MySQL database, data loading and extraction.
In this section, we fetch some sample customer data. Each record is labeled if this customer churned or not. We write this data out to Amazon S3. We create a table in a database to hold the data, then use a LOAD DATA FROM S3 statement to bulk load the data into an Aurora database table. We also store some historical customer messages into a second table.
Now we’re ready to walk through the use case.
In Section 2, we begin: The DBA has just received the request: “Please export the customer data to S3, so the data scientist can explore the reason for data churn. Thanks!”
Luckily, there’s an Amazon Aurora feature that makes it easy: Saving Data from an Amazon Aurora MySQL DB Cluster into Text Files in an Amazon S3 Bucket. A single SQL statement, and the DBA has done her part! Customer data, including summaries of their telephone plans, usage, and customer calls, are all now in S3. We’re ready to move on to the second notebook.
Running the ML algorithm and reviewing model results
The second notebook walks you through training an Amazon SageMaker XGBoost model on the customer data and using hyperparameter optimization (HPO) to find the best model. It also explores the features used in the best-performing model to understand the most impactful features. This is the information Marketing needs in order to put together an incentive program.
For the data exploration and training code, see the notebook part_2_preventing_customer_churn_XGBoost.ipynb
. (If you see the message select a kernel
when starting this notebook, choose conda_python3
.) For a pre-executed HTML version, see Preventing Customer Churn, Part 2. Building the ML Model.
In this notebook, we build two XGBoost models. The first model uses all the features available in the customer data. After dropping several highly-correlated columns and converting categorical variables (such as state
) to one-hot vectors, the training data has around 70 columns.
This post presents a sample run. In this run we saw a training AUC of 0.949, and a validation AUC of 0.884 for the first model. In all cases, due to randomization inherent in XGBoost, the details vary slightly from run to run. This is most likely when several features have similar predictive values and can be substituted for each other with similar results, which leads to similar model performance.
We then load the XGBoost trained model into our notebook to analyze the features that it used to achieve these results. First, we’ll use the built-in feature importance plots from XGBoost. The following graph shows gain, which is the improvement in accuracy brought by a feature to the branches it’s on.
The following graph shows cover, which measures the relative quantity of observations concerned by a feature.
A few features, such as cust_service_calls
, day_mins
, and int_plan_no
(a binary feature; that the customer has no international plan), are amongst the key drivers of customer churn. Feature importance tells you that this feature is influential but doesn’t tell you the direction (does it increase or decrease churn). The direction can often be inferred through business sense, or by evaluating the trees generated by the model, as shown below.
To address that question, this post presents a few of the trees generated. It’s not generally feasible to explore all trees, but it can be helpful to verify that the data the model uses and the model itself are both reasonable. This can be invaluable in identifying data extraction errors, spurious patterns, or the need for feature engineering.
The following diagram shows the first tree that XGBoost generated.
Interesting! The first split is on customer_service_calls
of 3.5. In this image, we see the potential to use these trees to suggest or understand different kinds of churners. We can easily plot additional trees, or pull a list of the key splits, such as the top few splits of the various trees. This information may provide Marketing with some ideas for different churn profiles amongst customer segments, which can lead to different targeted incentive programs.
A little data exploration supports the idea that four or more calls shift the percentage of churners (see the following screenshot). We’ll likely want to intervene before they make that fourth phone call.
In some cases of using the XGBoost algorithm—primarily where the data has little information, or where the model isn’t converging well—some or even many or most of the trees may contain only a single leaf, and so don’t add any decision power to the model. In others, we’ve seen that we need to expand the hyperparameter bounds we’re using due to the inherent complexity in the data. To see if that’s affecting this particular model, we calculate the depth of each tree, as in the following bar plot.
In this test run, several trees contain only a leaf, so it appears that further optimization is possible.
We also want to get a sense of which features influence the model by seeing how the features are used across the trees. The following graph shows the model features and the number of each that has been used to split a tree during training.
In the preceding graph, the majority of features aren’t even used in the tree splits. The following table shows the features that are used at least one time in the run and the number of times they’re used.
Feature | Number of Times Used |
day_mins |
8 |
cust_service_calls |
7 |
eve_mins |
7 |
int_plan_no |
6 |
int_charge |
6 |
vmail_msg |
6 |
night_mins |
5 |
int_calls |
4 |
day_calls |
1 |
Here, we see an opportunity! Whereas the training data had around 70 features, this model only uses 9 features. Due to the randomized nature of XGBoost, the exact number of times used and the rarely used features (small number of splits) change from run to run. A later section explores simplifying the model and thereby simplifying the calls from the production database, passing much less data to the model.
Evaluating the models
Next, we’ll evaluate our model against our test data. To test the model against the held-back test data, we deploy our model to an Amazon SageMaker endpoint created during the launch of the CloudFormation stack. This endpoint represents the inference functionality of Amazon SageMaker. For simplicity, this post uses a predefined endpoint_name
. The CloudFormation template setup specified this endpoint_name
in an IAM role and set the Aurora DB cluster group parameter aws_default_sagemaker_role
to this IAM role. This combination of settings gives Aurora permission to call the Amazon SageMaker endpoint that we’ll be creating here. We can then look at statistics such as the number of correct predictions and the accuracy of the model.
Now, we build an updated model using the same approach, but with only the features used in our first model. We evaluate it against the full model using standard ML measures such as accuracy, recall, and precision, as in the following table.
We can also assess the ROC curve for both models, as in the following graph.
The updated model provides results close to the original model; on some runs it performs slightly better, on others, slightly worse.
Because the difference between the two models is below our margin of error, and the updated model requires far less data to be passed in the call (9 rather than 70 features), we will use the updated model.
Assessing business impact
We can also assess the model performance by looking at the prediction scores and refining the threshold used to decide if someone is a churner. It’s common to treat this as a binary classification, but the real world is less binary. People are considered likely to churn for some time before they actually churn. Loss of brand loyalty occurs some time before someone buys from a competitor.
The following graph shows that the continuous valued predictions coming from the model tend to skew toward 0 or 1, but there is sufficient mass between 0.1 and 0.9 that adjusting the cutoff should indeed shift many customers’ predictions. For now, we’ll take a simple approach of optimizing the threshold assuming a single incentive, to get a starting point for future analysis.
At this point, Marketing has not yet told us the incentives they’d like to use, so let’s assume a retention incentive of $50, for customers that our model identifies, and a cost of $500 for each false negative. We can use these numbers in a cost optimization model to identify the optimal threshold.
The following graph shows how setting a churn threshold too low (below 0.1) results in costs skyrocketing as all customers are given a retention incentive. Additionally, setting the threshold too high (0.7 or above) results in too many lost customers, which ultimately grows to be nearly as costly. In between, there is a large gap where perhaps some more nuanced incentives would create better outcomes.
Using these incentive numbers, we can minimize the overall cost at $5950 by setting the cutoff to 0.12, which is substantially better than the $20,000 or more the business can expect to lose by not taking any action. These numbers can be used by Marketing as a starting point for their incentive planning exercise.
Developing incentives
The ML engineer now returns to Marketing with the results of her analysis.
The feature importance charts showed that a small number of features make up most of the churn prediction: the number of customer service calls, use of an international plan, and number of minutes of calls during the day.
The feature importance charts show which features in the XGBoost model contribute most to the model performance, but they don’t specify which direction or at what values these variables cause a change in outcome. Inspecting the trees can give us some insight. For the tree shown earlier and the associated churn probabilities, making a fourth customer service call seems to signal a major shift in the odds of someone becoming a churner. Another segmentation seems to be at around 3 hours of calls a day.
Based on these insights, Marketing can develop a set of incentives to offer customers. They can choose several methods of rolling out incentives, such as targeting specific customers via mailings or other outreach methods. Here, we focus on intervention at the time of the customer’s next call to customer service. Key to using these incentives is the ability for Customer Service to assess each customer call and identify and make a relevant offer if appropriate. The next section shows how to implement this capability.
For this use case, we begin with some simple heuristics, based on the customer’s sentiment during the call and their use of an international plan.
Using ML capabilities in production
With the ML model trained and deployed, and advice from Marketing on the incentive programs to use, we’re now ready to put the pieces together.
The third notebook, part_3_preventing_customer_churn_inferences_from_Amazon_Aurora.ipynb
, shows how to connect to and get the results of the ML capabilities—Amazon SageMaker and Amazon Comprehend—as part of a SQL query to Aurora. With both of these pieces of information in hand, we can make an on-the-fly decision about what incentive program to offer the customer. For a pre-executed HTML version, see Preventing Customer Churn, Part 3. Inference from Amazon Aurora.
We can perform all this work on Aurora, and will likely incorporate these queries on a backend of an internal application so customer service can in real time understand the risk of a customer churning based on a message from that customer and that customer’s details.
First, we’ll call Amazon Comprehend from Aurora. That way, we can also get an assessment of the customer’s sentiment in their message to customer service. The CloudFormation template already gave Aurora permission to call Amazon Comprehend by adding an IAM role to the DB cluster parameter group parameter aws_default_comprehend_role
. Now, calling Amazon Comprehend is as simple as issuing a SQL query, as shown below.
The query response lists the original message and two outputs from Amazon Comprehend: the sentiment (POSITIVE
), and the confidence level (in the preceding code, very high).
Now we’re ready to connect our Amazon SageMaker model. To do so, we create a function to call the Amazon SageMaker endpoint. For instructions, see Enabling Aurora Machine Learning. In Aurora, we create the SQL function will_churn
that uses the variables the model needs. We now include the columns created by one-hot encoding in the previous section.
Now we need to set up the Aurora database to call the Amazon SageMaker endpoint and pass the data it needs to return a prediction. We do so by creating a SQL function that collects the data, performs any transformations we need (for example, combining data or changing data types), and calls the model endpoint.
Our original data contained numeric variables and several categorical variables (such as area_code
and int_plan
). During the creation of the ML model, the categorical variables were converted to one-hot vectors. This conversion can lead to an explosion in the number of columns the model uses; for this use case, the original 15 feature columns became 70 after conversion to one-hot. Because only a short list of the columns was used, we can build a model using only those columns. This simplifies the database call to the ML model because it reduces the number of fields the SQL needs to include in the call.
In the final model, we used only one of the categorical values: int_plan_no
. However, the database column is int_plan
, and must be converted to its one-hot representation. There are two ways to approach this problem: add transformation code to the ML endpoint or create functions in the SQL database that represent one-hot encoded variables.
For simplicity, this post demonstrates the second option. We create a SQL function, IntPlanNo(int_plan)
, that returns the one-hot encoded value for this field. For more complex transformations, the first option is likely a better option.
Ready, set, go!
Now, finally, we can put all the pieces together!
We’ve taken the incentive programs defined by Marketing and, for demonstration purposes, created a Python function, assess_and_recommend_incentive
. The function takes the customer phone number and message the customer sent to customer service (via a chat, for example). It queries the database for customer details needed by Customer Service, and returns a likelihood of customer churn and a suggested incentive. The function contains several fixed incentives for customers with certain characteristics. We can also test for alternatives with a randomized incentive. The output contains the recommended incentive and a short message explaining the reason that incentive was chosen. For a production setting we’ll make these same calls from a business application.
The following code shows the results of two calls to the function. The first, with a negative customer message and a customer likely to churn, results in a credit recommendation. The second, with a neutral customer message and a less-likely-to-churn customer, results in no incentive offered.
Now we can give customer service the tool they need. With the customer’s phone number and text of the customer’s concern, we can immediately identify an incentive to offer based on their history in the database and their current interaction.
We can also test the effect of different values for the incentive program against the true and false positives and negatives returned by the model. This allows us to estimate the economic impact of our planned incentives.
Cleaning up
To avoid additional charges, you should delete the CloudFormation stack after walking through the solution.
You should also delete the S3 bucket and Amazon SageMaker endpoints created during model training. A cell at the end of the Part 2 notebook has code that deletes these for you.
Conclusion
Now we have the ability to detect, during a customer interaction, whether this is an at-risk customer, and to intervene by providing them with an incentive to stay. We can choose the factors on which we want to base our response: the sentiment of this interaction, combined with the customer’s current characteristics, interaction history, and estimated risk. We can come back to our CEO with a proposed set of incentives, and an estimate for the cost of our incentive program; or even with a set of alternatives, showing separate optimization curves for each. We’ve gained valuable insights into our customers’ behavior, and we have plans to gather more data for more insights.
To gather more data on effective incentives and customer behavior, we can also randomize our responses. We have the beginnings of understanding the key factors, and the beginnings of an experimentation platform. We can use that data in a future ML model to further refine the incentives offered.
We can now begin to add experimentation and nuance to our incentive program, such as the following:
- Record the customer’s sentiment when we offer an incentive, and also their churn prediction at this time, the current value of key drivers, and the incentive offered.
- Randomize the incentives offered to perform A/B testing of different incentives to customers with similar characteristics.
- Experiment with different kinds of incentives beyond a simple dollar rebate, such as new telephones, plan upgrades, and more.
- Store the incentive offered and later analyze their use and the results achieved. How much of an incentive must we offer to which kinds of customers to reduce churn? How much is retaining that customer worth to us, versus the cost of the incentive? What is the customer’s sentiment on hearing the incentive?
- Add economic analysis. How much is it worth to keep this customer? What’s the optimal threshold score to maximize profit?
As we explore these alternatives, we move from prediction to providing tangible and actionable business value. Over to you, Marketing!
About the authors
Dr. Veronika Megler is Principal Consultant, Data Science, Big Data & Analytics, for AWS Professional Services. She holds a PhD in Computer Science, with a focus on scientific data search. She specializes in technology adoption, helping companies use new technologies to solve new problems and to solve old problems more efficiently and effectively.
Vitalina Komashko is a Data Scientist for AWS Professional Services. She holds a PhD in Pharmacology and Toxicology, but don’t hold it against her. She specializes in reproducible research, clean code and helping biotech and pharma to formulate and solve problems with scalable solutions.