This tutorial uses the Iris dataset that is provided in the UCI Machine Learning Repository.
RandomForest I/F has changed due to a v0.5.0 release on April 12, 2018. |
Data preparation
Upload Iris data to Arm Treasure Data.
$ wget http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data $ sed '/^$/d' iris.data | awk 'BEGIN{OFS=","}{print NR,$0}' | sed '1i\ rowid,sepal_length,sepal_width,petal_length,petal_width,class ' > iris.data.csv $ head -3 iris.data.csv rowid,sepal_length,sepal_width,petal_length,petal_width,class 1,5.1,3.5,1.4,0.2,Iris-setosa 2,4.9,3.0,1.4,0.2,Iris-setosa $ td db:create iris $ td table:create iris original $ td import:auto --format csv --column-header --time-value `date +%s` --auto-create iris.original iris.data.csv
Then, create a mapping table to assign a label for each class.
$ td table:create iris label_mapping $ td query -x --type hive -d iris " INSERT OVERWRITE TABLE label_mapping select class, rank - 1 as label from ( select distinct class, dense_rank() over (order by class) as rank from original ) t; "
`train_randomforest_classifier` requires the target `label` starting from 0. |
After that, prepare a training table for RandomForest training.
$ td table:create iris training $ td query -x --type hive -d iris " INSERT OVERWRITE TABLE training select rowid() as rowid, array(t1.sepal_length, t1.sepal_width, t1.petal_length, t1.petal_width) as features, t2.label from original t1 JOIN label_mapping t2 ON (t1.class = t2.class); "
Train
Run training using a RandomForest classifier. The following example builds 50 decision trees for each mapper.
$ td table:create iris model $ td query -x --type hive -d iris " INSERT OVERWRITE TABLE model select train_randomforest_classifier(features, label, '-trees 50') from training; "
No need to use `amplify` or `rand_amplify` for train_randomforest_classifier. |
Training options
You can get information about hyperparameter for training using -help option as follows:
$ td query -w --type hive -d iris " select train_randomforest_classifier(features, label, '-help') from training; " usage: train_randomforest_classifier(array<double|string> features, int label [, const string options, const array<double> classWeights])- Returns a relation consists of <string model_id, double model_weight, string model, array<double> var_importance, int oob_errors, int oob_tests> [-attrs <arg>] [-depth <arg>] [-help] [-leafs <arg>] [-min_samples_leaf <arg>] [-rule <arg>] [-seed <arg>] [-splits <arg>] [-stratified] [-subsample <arg>] [-trees <arg>] [-vars <arg>] -attrs,--attribute_types <arg> Comma separated attribute types (Q for quantitative variable and C for categorical variable. e.g., [Q,C,Q,C]) -depth,--max_depth <arg> The maximum number of the tree depth [default: Integer.MAX_VALUE] -help Show function help -leafs,--max_leaf_nodes <arg> The maximum number of leaf nodes [default: Integer.MAX_VALUE] -min_samples_leaf <arg> The minimum number of samples in a leaf node [default: 1] -rule,--split_rule <arg> Split algorithm [default: GINI, ENTROPY] -seed <arg> seed value in long [default: -1 (random)] -splits,--min_split <arg> A node that has greater than or equals to `min_split` examples will split [default: 2] -stratified,--stratified_sampling Enable Stratified sampling for unbalanced data -subsample <arg> Sampling rate in range (0.0,1.0]. [default: 1.0] -trees,--num_trees <arg> The number of trees for each task [default: 50] -vars,--num_variables <arg> The number of random selected features [default: ceil(sqrt(x[0].length))]. int(num_variables * x[0].length) is considered if num_variable is (0,1]
Parallelize Training
In Treasure Data, each MapReduce task is launched for each 512MB data chunk. Each task is permitted to use only 1 virtual CPU core and then the training time of RandomForest training is linear to the number of decision trees.
To parallelize RandomForest training by Threading, you can use UNION ALL
as follows:
$ td query -x --type hive -d iris " INSERT OVERWRITE TABLE model select train_randomforest_classifier(features, label, '-trees 25') from training UNION ALL select train_randomforest_classifier(features, label, '-trees 25') from training; "
Alternatively, you can run multiple INSERT INTO
queries for a training as follows:
$ td query -x --type hive -d iris " INSERT INTO TABLE model select train_randomforest_classifier(features, label, '-trees 25') from training "
Variable importance and Out-of-the-Bag test
The output of training includes information to show variable importance and out-of-bag (OOB) test results.
$ td query -w --type hive -d iris " select array_sum(var_importance) as var_importance, sum(oob_errors) / sum(oob_tests) as oob_err_rate from model; "
var_importance | oob_err_rate |
---|---|
[15.419672515790172,6.40339076572934,29.40103441471922,31.947085260871326] | 0.04666666666666667 |
Predict
We can get a prediction result using the prediction model as follows:
$ td table:create iris predicted $ td query -w -x --type hive -d iris " WITH t2 as ( SELECT rowid, rf_ensemble(predicted.value, predicted.posteriori, model_weight) as predicted -- rf_ensemble(predicted.value, predicted.posteriori) as predicted -- avoid OOB accuracy (i.e., model_weight) FROM ( SELECT t.rowid, p.model_weight, tree_predict(p.model_id, p.model, t.features, '-classification') as predicted FROM model p LEFT OUTER JOIN training t ) t1 group by rowid ) INSERT OVERWRITE TABLE predicted SELECT rowid, predicted.label, predicted.probability, predicted.probabilities FROM t2 "
To use model created by v0.4.2, use tree_predict_v1 instead of tree_predict as follows: tree_predict_v1(p.model_id, p.model_type, p.pred_model, t.features, true) |
Evaluate
You can evaluate the accuracy of the training as follows:
$ td query -w --type presto -d iris " select count(1) from training; " > 150 $ td query -w --type hive -d iris " WITH t1 as ( SELECT t.rowid, t.label as actual, p.label as predicted FROM predicted p LEFT OUTER JOIN training t ON (t.rowid = p.rowid) ) SELECT count(1) / 150.0 FROM t1 WHERE actual = predicted; " > 0.9933333333333333
Export models in human-readable format
You can export prediction models into JavaScript or Graphviz format.
$ td table:create iris model_exported $ td query -w --type hive -d iris " INSERT OVERWRITE TABLE model_exported select model_id, tree_export(model, "-type javascript", array('sepal_length','sepal_width','petal_length','petak_width'), array('Setosa','Versicolour','Virginica')) as js, tree_export(model, "-type graphvis", array('sepal_length','sepal_width','petal_length','petak_width'), array('Setosa','Versicolour','Virginica')) as dot from model " usage: tree_export(string model, const string options, optional array<string> featureNames=null, optional array<string> classNames=null) - exports a Decision Tree model as javascript/dot] [-help] [-output_name <arg>] [-r] [-t <arg>] -help Show function help -output_name,--outputName <arg> output name [default: predicted] -r,--regression Is regression tree or not -t,--type <arg> Type of output [default: js, javascript/js, graphvis/dot
Graphvis dot data can be visualized on viz-js.com.
Comments
0 comments
Please sign in to leave a comment.