Goal: Churn prediction on public data from US telecom
This tutorial uses a “churn” data set which consists of the churn history of phone numbers used in a book “Discovering Knowledge in Data: An Introduction to Data Mining.” Assume the data is already imported to a table as:
The table has 3,333 records (such as customers, phone numbers; 1 record = 1 phone number), and each customer has 20 attributes (such as day calls, account length and international plan) and 1 label column “Churn” (
False.). Note that, in reality, the number of records should be sufficiently large; more than 10,000 customers is preferable.
The goal is to create a predictive model that finds out “customers who are likely to churn in near future.”
From audience creation to machine learning-based syndication
Step 1. Create an audience
First, create an audience based on the data. Because the data is quite simple, you can create an audience by directly using the table as a master table:
When you click Run, the customer data is generated.
In reality, real data could be more messy, and therefore you might need to preprocess your data to create a reasonable audience with informative attributes.
Step 2. Create batch segments which represent your “goal”
Next, as described in overview, define batch segments representing what you want, churn prediction. In this example, the goal is to put predictive scores to customers who have not churned yet. The separation between population, positive samples and scoring target can be illustrated as follows:
On the segmentation builder, segments to create are:
- Positive samples
- Scoring target
This separation means:
- A predictive model is built based on the full customer set, and the model represents characteristics of customers who are in the positive samples segment.
- In a scoring step, only active customers get a predictive score according to their possibility of future churn.
Step 3. Configure predictive scoring
When the segments are properly defined, you are ready to implement predictive scoring. Specify the dependent segments and attributes used for prediction on the “Predictive Scoring” view:
Choosing a subset of attributes is a part of feature engineering in the context of machine learning, and data scientists generally spend significant amount of time to find an appropriate feature set.
In contrast, in order to allow non-experts to choose reasonable attributes, you can use the feature guess function. When you click Guess Columns in the configuration view, you see a suggested set of attributes (for example, columns) that you can use to make a reasonable prediction on your audience and segments. How Feature Guess Works describes the detail of this functionality.
Selected columns are categorized into following three types:
- Categorical Columns
- Attributes which are not meaningful as a numeric value
- such as gender, day of week, group, …
- Categorical Array Columns
- Array column on TD which can be treated as single categorical information
- such as
td_affinity_categoriesgenerated by content affinity engine, list of games played before, …
- Quantitative Columns
- Numeric values
- such as age, price, frequency, …
If you have adequate knowledge of your data, you can manually add and remove columns.
Step 4. Learning from the population, and assigning predictive scores to customers
Ultimately, what you need to do for predictive scoring is two-fold:
- Click Run on the Predictive Scoring view (~10min)
- Click Run on the Audience view (~7min)
Each run operation internally executes Treasure Workflow; the first workflow learns characteristics of customers who are in the population set, and builds a predictive model, and the second workflow re-generates the audience and assigns a predictive score to customers in the scoring target segment at the time when you click Run. As described in the Audience Suite overview, you cannot edit these internal workflows.
Step 5. Review the data on the dashboard
After the audience is successfully re-generated, the dashboard tab is enabled on the Predictive Scoring view:
The histogram shows the distribution of predictive scores; the horizontal axis corresponds to predictive score distributed from 0 to 100, and the vertical axis indicates the number of scored customers. Meanwhile, in the different colors, customer behavior is scored and customers are categorized into two groups, “Unknown/Unconverted” and “Converted”; if a customer is in the positive samples segment, the customer is in the “Converted” group, and vice versa.
Additionally, based on the thresholds adjusted by a seek bar located under the histogram, each of the customers is assigned to one of four “likely” grades, Likely, Possibly, Marginally and Unlikely:
The example illustrates, for example, no active customers are categorized into the Likely grade, and 29 active customers are in the Possibly grade. If you like to reach to more “likely” customers, you must adjust the right-most threshold to smaller value on the seek bar so that the percentage in the Likely circle is increased to a desired value.
Step 6. Create new batch segment based on the predictive scores
After thresholds are adjusted to desired positions, click Create New Segment located at the right corner:
In the pop-up window, you create a new batch segment based on the predictive scores. The preceding example assumes that you are interested in Possibly and Likely customers. In addition, you can specify if the segment includes customers who are in the positive sample or population segment, or in both.
A new segment based on the predictive scoring is created as follows:
Because the Possibly and Likely grade respectively have 0 and 29 customers according to the dashboard, this batch segment contains 29 “promising” customers in total.
Now, you can setup syndication on the segment. For instance, you can send an email (such as special champaign information) to the customers who might churn in near future to prevent their churn.
Understand predictive model and tune it to achieve higher accuracy
Machine learning on real-world data is not simple, and is sometimes inaccurate or results in undesired prediction result. You can use auxiliary information, provided at the bottom of the Predictive Scoring view to understand and improve your predictive scoring:
Statistics of your audience are shown with an estimated accuracy of prediction:
Plus, you can visually confirm which attributes strongly contribute to the prediction and what kind of values exist in each attribute:
Reviewing the information provided in this predictive scoring tutorial, you can see that customer service calls positively contributes to customer churn, and that no international plan leads to lower churn rate.
See How to Tune Predictive Scoring for further guidance.