Data preparation
Download ml-20m.zip and unzip it. Then, create a database and import the raw ratings data into Arm Treasure Data from the downloaded CSV. The --time-value
is used to add a dummy time column (This is because Treasure Data requires each row have a timestsamp).
$ td db:create movielens20m $ td table:create movielens20m ratings $ td import:auto --format csv --column-header --time-value `date +%s` --auto-create movielens20m.ratings ./ratings.csv
The first step is to split the original data for training and testing.
$ td table:create movielens20m ratings_mf $ td query -w -x -d movielens20m " INSERT OVERWRITE TABLE ratings_mf SELECT rand(31) as rnd, userid, movieid, rating FROM ratings "
The data is split 80% for training and 20% for testing.
$ td table:create movielens20m training_mf $ td query -x --type hive -d movielens20m " INSERT OVERWRITE TABLE training_mf SELECT userid, movieid, rating, rnd FROM ratings_mf ORDER BY rnd DESC LIMIT 16000000 " $ td table:create movielens20m testing_mf $ td query -x --type hive -d movielens20m " INSERT OVERWRITE TABLE testing_mf SELECT userid, movieid, rating, rnd FROM ratings_mf ORDER BY rnd ASC LIMIT 4000263 "
Training
Calculate the mean rating in the training_mf dataset.
td query -w --type presto -d movielens20m " SELECT AVG(rating) FROM training_mf " > 3.52576146875
The above average value used in the following queries.
Run training of Matrix Factorization. Result output to TD (—result) feature is not recommended for issuing the following query. Use INSERT INTO table
statement instead.
$ td table:create movielens20m mf_sgd_model_f20 $ td query -x --type hive -d movielens20m " INSERT OVERWRITE TABLE mf_sgd_model_f20 SELECT idx, array_avg(u_rank) as Pu, array_avg(m_rank) as Qi, avg(u_bias) as Bu, avg(m_bias) as Bi FROM ( SELECT train_mf_sgd(userid, movieid, rating, '-factor 20 -mu 3.52576146875 -iter 50') AS (idx, u_rank, m_rank, u_bias, m_bias) FROM training_mf ) t GROUP BY idx "
The signature of train_mf_sgd
is train_mf_sgd(int userid, int itemid, numeric rating [, string options])
.
Training options
You can get information about hyperparameter for training using -help
option as follows:
$ td query -w --type hive -d movielens20m " SELECT train_mf_sgd(userid, movieid, rating, '-help') AS (idx, u_rank, m_rank, u_bias, m_bias) FROM training_mf " usage: MatrixFactorizationSGDUDTF [-cv_rate <arg>] [-disable_bias] [-disable_cv] [-eta <arg>] [-eta0 <arg>] [-help] [-iter <arg>] [-k <arg>] [-maxval <arg>] [-min_init_stddev <arg>] [-mu <arg>] [-power_t <arg>] [-r <arg>] [-rankinit <arg>] [-t <arg>] [-update_mean] -cv_rate,--convergence_rate <arg> Threshold to determine convergence [default: 0.005] -disable_bias,--no_bias Turn off bias clause -disable_cv,--disable_cvtest Whether to disable convergence check [default: enabled] -eta <arg> The initial learning rate [default: 0.001] -eta0 <arg> The initial learning rate [default 0.2] -help Show function help -iter,--iterations <arg> The number of iterations [default: 1] -k,--factor <arg> The number of latent factor [default: 10] -maxval,--max_init_value <arg> The maximum initial value in the rank matrix [default: 1.0] -min_init_stddev <arg> The minimum standard deviation of initial rank matrix [default: 0.1] -mu,--mean_rating <arg> The mean rating [default: 0.0] -power_t <arg> The exponent for inverse scaling learning rate [default 0.1] -r,--lambda <arg> The regularization factor [default: 0.03] -rankinit <arg> Initialization strategy of rank matrix [random, gaussian] (default: random) -t,--total_steps <arg> The total number of training examples -update_mean,--update_mu Whether update (and return) the mean rating or not
Predict
$ td table:create movielens20m sgd_predict_f20 $ td query -x --type hive -d movielens20m " INSERT OVERWRITE TABLE sgd_predict_f20 SELECT t2.actual, mf_predict(t2.Pu, p2.Qi, t2.Bu, p2.Bi, 3.52560165625) as predicted FROM ( SELECT t1.userid, t1.movieid, t1.rating as actual, p1.Pu, p1.Bu FROM testing_mf t1 LEFT OUTER JOIN mf_sgd_model_f20 p1 ON (t1.userid = p1.idx) ) t2 LEFT OUTER JOIN mf_sgd_model_f20 p2 ON (t2.movieid = p2.idx) "
Evaluate (computes MAE and RMSE)
$ td query -w --type hive -d movielens20m " SELECT MAE(predicted, actual) AS mae, RMSE(predicted, actual) AS rmse FROM sgd_predict_f20 "
mae | rmse |
---|---|
0.6123907679836259 | 0.8027164481776642 |
Comments
0 comments
Please sign in to leave a comment.