This content has moved to https://go.treasuredata.com/docs
The goal of this tutorial is to perform a churn prediction on public data from a US telecom. This tutorial uses a data set which consists of the churn history of phone numbers from a book “Discovering Knowledge in Data: An Introduction to Data Mining.”
Note: This is a tutorial overview and does not include steps or screen captures for a complete scenario.
Assume the data is already imported to Treasure Data as a table:
The table has 1,000,000 records (such as customers, phone numbers; 1 record = 1 phone number), and each profile has 20 attributes (such as day calls, account length and international plan) and 1 label column “Churn” (
The goal is to create a predictive model that determines the customers who are likely to churn in the near future.
- Create a master segment based on the data. Because the data is quite simple, you can create a master segment by directly using the table as a master table:
- Click Run to generate the profile data.
In reality, you might need to preprocess your data to create a reasonable master segment with informative attributes.
Create batch segments which represent the churn behavior
Define batch segments representing a churn prediction. In this example, the goal is to put predictive scores to customers who have not churned yet. See Predicting Customer Behavior. The separation between population, positive samples, and scoring target can be illustrated as follows:
Create segments. Separate segments mean that:
- A predictive model is built based on the full master segment, and the model represents characteristics of customers (profiles in the set) who are in the positive samples segment.
- In a scoring step, only active customers get a predictive score according to their possibility of future churn.
Configure predictive scoring
After the segments are defined, you are ready to implement predictive scoring.
Specify the dependent segments and attributes used for prediction in Predictive Scoring:
Choosing a subset of attributes is a part of feature engineering in the context of machine learning. Data scientists generally spend significant amounts of time to find an appropriate feature set.
To allow non-experts to choose reasonable attributes, you can use Treasure Data's feature guess function. Click Suggest Predictive Features to see a suggested set of attributes that you can use to make a reasonable prediction on your profile set and segments. See How Feature Guess Works.
Selected columns are categorized into the following types:
- Categorical Features
- Attributes which are not meaningful as a numeric value
- such as gender, day of week, group, …
- Categorical Array Features
- Array column on TD which can be treated as single categorical information
- such as
td_affinity_categoriesgenerated by content affinity engine, list of games played before, …
- Quantitative Features
- Numeric values
- such as age, price, frequency, …
You can add and remove columns.
Learning from the population, and assigning predictive scores to customers
Ultimately, what you need to do for predictive scoring is:
- Click Run on the Predictive Scoring view (~10min)
- Click Run on the Master Segment view (~7min)
Each run operation internally executes Treasure Workflow. You cannot edit these internal workflows.
- The first workflow learns characteristics of customers (profiles) who are in the population set (master segment), and builds a predictive model,.
- The second workflow re-generates the profile set and assigns a predictive score to profiles in the scoring target segment at the time when you click Run.
Review the data on the dashboard
After the master segment is successfully re-generated, review your dashboard:
The histogram shows the distribution of predictive scores. The horizontal axis corresponds to predictive score distributed from 0 to 100. The vertical axis indicates the number of scored profiles (customers). The different colors score and categorize customers. Customer behavior is scored. Customers are categorized into two groups:
If a customer is in the positive samples segment, the customer is in the Converted group.
Based on the thresholds, adjusted by a seek bar located under the histogram, each of the customer profiles is assigned to one of four grades:
Likely, Possibly, Marginally and Unlikely
For example, no active customers are categorized into the Likely grade, and 29 active customers are in the Possibly grade. If you like to reach to more “likely” customers, you must adjust the right-most threshold to smaller value on the seek bar so that the percentage in the Likely circle is increased to a higher value.
Create a new batch segment based on the predictive scores
After thresholds are adjusted to desired positions, select Create New Segment:
You create a new batch segment based on the predictive scores. For example, you are interested in Possibly and Likely customers.
You can also specify if the segment includes customers who are in the positive sample, population segment, or in both.
A new segment based on the predictive scoring is created as follows:
Because the Possibly and Likely grade respectively have 0 and 29 customers according to the dashboard, this batch segment contains 29 “promising” customers in total.
Now, you can setup activation on the segment. For instance, you can send an email (such as special campaign information) to the customers who might churn in the near future to prevent their churn.
Understand predictive model and tune it to achieve higher accuracy
Machine learning on real-world data is not simple, and is sometimes inaccurate or results in an undesired prediction result. You can use auxiliary information, provided at the bottom of the Predictive Scoring view to understand and improve your predictive scoring:
Statistics of your audience are shown with an estimated accuracy of prediction:
Plus, you can visually confirm which attributes strongly contribute to the prediction and what kind of values exist in each attribute:
Reviewing the information, you can see that customer service calls positively contributes to customer churn, and that no international plan leads to lower churn rate.