By combining your knowledge of SQL, TD Console, and a few Hivemall functions you can run advanced machine learning. Hivemall allows you to use machine learning for sales forecasting and scoring customers for marketing without having to program machine learning libraries. Hivemall provides machine learning functionality using SQL queries. Before reading this article, we highly recommend that you read Treasure Data Machine Learning Overview.
In this tutorial, you will learn how to build a regression model that predicts continuous values. You can build predictive models on TD Console by issuing Hive and Presto queries.
If you use train_regressor()
, you can solve a regression problem, such as sales prediction, sensor data prediction or production volume prediction.
In this tutorial, you will learn:
- How to build a linear regression model to predict the number of purchases for retailer industry.
- How to evaluate the accuracy of prediction using Root Mean Square Error or Mean Absolute Error.
Other information sources:
- Machine Learning tutorial for SQL users
- Apache Hivemall User Guide
- Treasure Data Machine Learning articles
Creating Customer data
- Open TD Console.
- Create a database for this tutorial. Name it using only lower case letters, numbers, and _. For example, my_ml_db.
- Select Create.
The database is created and TD Console displays the page you can use to add tables. - Navigate to Data Workbench > Queries.
- Select New Query.
- Select the database that you just created in the left side drop down.
- Make sure that the Presto is selected.
- Paste the following code into the editor.
create table if not exists customers as
select 1 as id, 'male' as gender, 23 as age, 'Japan' as country, 12 as num_purchases
union all
select 2 as id, 'female' as gender, 43 as age, 'US' as country, 4 as num_purchases
union all
select 3 as id, 'other' as gender, 19 as age, 'UK' as country, 2 as num_purchases
union all
select 4 as id, 'male' as gender, 31 as age, 'US' as country, 20 as num_purchases
union all
select 5 as id, 'female' as gender, 37 as age, 'Australia' as country, 9 as num_purchases
;
- Name and save the query. For example, create_customers.
- Run the query.
- Navigate to Data Workbench > Databases > my_ml_db > customers to validate the data is similar to the following:
Building feature representation
Our goal in this portion of the tutorial is to build a regression model to predict the potentially number of purchases done by new customers.
The query in this portion of the tutorial allows you to extract the feature vector and target variable. For example, num_purchases
.
- From Data Workbench > Databases.
- Navigate to my_ml_db.
- Select New Table.
- Name it
training_lr
. - Save the table.
- From Data Workbench > Query.
- Create a new query.
- Select my_ml_db as the database to run and store the query.
- Select Hive as the processing engine.
- Copy and paste the following code into the editor of the query:
insert overwrite table training_lr
select
id,
array_concat(
quantitative_features(
array("age"),
age
),
categorical_features(
array("country", "gender"),
country, gender
)
) as features,
num_purchases
from
customers
;
- Save your query and name it feature_rep.
- Run your query. The job messages might indicate that something went wrong. However, data might be in the table as it should be.
- Validate that feature_rep is populated with data similar to the following:
Building Machine Learning Training using regressor weight
train_regressor()
requires you to specify an appropriate loss function. One option is to use loss function with
squared
.
In this portion of the tutorial you will insert trained regressor weights.
-loss_function squared
means that the query builds a simple linear regressor with squared loss while it optimizes the parameters based on the AdaGrad
optimization scheme with l2
regularization.
- From Data Workbench > Databases.
- Navigate to my_ml_db.
- Select New Table.
- Name it
regressor
. - Save the table.
- From Data Workbench > Query.
- Create a new query.
- Select my_ml_db as the database to run and store the query.
- Select Hive as the processing engine.
- Copy and paste the following code into the editor of the query:
insert overwrite table regressor
select
train_regressor(
features, -- feature vector
num_purchases, -- target value
'-loss_function squared -optimizer AdaGrad -regularization l2' -- hyper-parameters
) as (feature, weight)
from
training_lr
;
- Save and run the query.
- Validate that the regressor table populates with data similar to:
Note that the values of the weight column may be different since the output depends on the input order of the training_lr table. - Optionally, run the function with
-help
option to list available options:
select train_regressor(array(), 0, '-help');
Creating new customer
This portion of the tutorial includes adding a new table with data.
- From Data Workbench > Databases.
- Navigate to my_ml_db.
- Select New Table.
- Name it new_customers
.
- Save the table.
- From Data Workbench > Query.
- Create a new query.
- Select my_ml_db as the database to run and store the query.
- Select Hive as the processing engine.
- Copy and paste the following code into the editor of the query:
insert overwrite table new_customers
select 1 as id, array("gender#male", "age:10", "country#Japan") as features
union all
select 2 as id, array("gender#female", "age:60", "country#US") as features
union all
select 3 as id, array("gender#other", "age:50", "country#UK") as features
;
Validate that data similar to the following is added to the new_customers table:
Predicting new customer behavior
This portion of the tutorial describes how to build prediction without the need to pass through the sigmoid()
function.
To create prediction without using the sigmoid()
function:
- Create a new query.
- Select my_ml_db as the database to run and store the query.
- Select Hive as the processing engine.
- Copy and paste the following code into the editor of the query:
with features_exploded as (
select
id,
extract_feature(fv) as feature,
extract_weight(fv) as value
from new_customers t1 LATERAL VIEW explode(features) t2 as fv
)
select
t1.id,
sum(p1.weight * t1.value) as predicted_num_purchases
from
features_exploded t1
LEFT OUTER JOIN regressor p1 ON (t1.feature = p1.feature)
group by
t1.id
;
The output is similar to:
Evaluating the predictions
This portion of the tutorial uses Root Mean Square Error rmse()
or Mean Absolute Error mae()
for the evaluation of regressors.
- Create another query.
- Select Hive processing.
- Save it as prediction_evaluation<#>.
- Add the following code to the editor:
with features_exploded as (
select
id,
extract_feature(fv) as feature,
extract_weight(fv) as value
from training_lr t1 LATERAL VIEW explode(features) t2 as fv
),
predictions as (
select
t1.id,
sum(p1.weight * t1.value) as predicted_num_purchases
from
features_exploded t1
LEFT OUTER JOIN regressor p1 ON (t1.feature = p1.feature)
group by
t1.id
)
select
rmse(t1.predicted_num_purchases, t2.id) as rmse,
mae(t1.predicted_num_purchases, t2.id) as mae
from
predictions t1
join
training_lr t2 on (t1.id = t2.id)
;
Validate that the output is similar to:
Comments
0 comments
Please sign in to leave a comment.