Treasure Data provides the following approaches for Machine Learning.
|TD Console||Web-based user interface that requires little to no programming experience. Particularly designed for Marketers.|
|Query-based||Query-based flexible and scalable way for everyone who knows SQL. Particularly designed for Data Engineers. This method involves using TD Console, Hivemall and Digdag.|
|pytd.panadas_td||pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage. Particularly designed for Data Scientists and Machine Learning Engineers. pytd allows Python code to efficiently read and write a large volume of data from and to Treasure Data.|
The TD Console is optimized for your company's customer data management, and is designed for marketers who might not be familiar with how to code machine learning algorithms.
The application has the following machine learning features:
- Content Affinity Engine - enables you to enrich customer data from customer behavior on websites.
- Predictive Customer Scoring - detects high potential customers for marketing campaign focus.
Content Affinity Engine
The content affinity engine associates interest words to customers using web site information, including page title and description. For more details, see the article on data enrichment.
Predictive scoring enables you to get potential customer segments, who are likely to buy, churn, click, or convert in near future. You can build a predictive model to find potential customers/prospects who likely to converge by defining conversions on customer behavior data. See this page for more details.
Treasure Data machine learning provides flexible and scalable capability based on Apache Hivemall and Treasure Data Workflows. By using the TD Console to define and run your SQL queries, you can build your prediction model on your own. You can rapidly evolve machine learning tasks because there is no need to move data to and from Treasure Data.
The following is an example query that uses Hivemall to begin to train a classifier with logistic regression and predicting labels. The following example is from the Hivemall tutorial:
'-loss_function logloss -optimizer SGD'
) as (feature, weight)
After you convert your table into pairs of
label, you can build a binary classifier. When the table
classifier has liner coefficients for given features, you can build the code to predict unforeseen samples by computing a weighted sum of their features. Prediction for the feature vectors can be made using a join operation between each feature in your model.
Hivemall supports the following machine learning tasks:
- Binary and multi-class classification
- Anomaly detection
- Natural language processing
- Clustering (i.e., topic modeling)
- Data sketching
Managing the machine learning pipeline using Treasure Data workflow
After ingesting your data into Treasure Data, you can build a predictive model using Treasure Data queries, workflows, and Hivemall.
The typical machine learning pipeline for supervised learning is:
- data preparation
- building a model
- evaluating the model
- predicting unseen data with trained model
For more information, refer to the wiki on supervised learning.
You can use Treasure Data Workflows to manage your supervised learning process. By using Digdag Treasure Data operators within your TD workflow, you can automate your machine learning from data preparation to prediction. Digdag Treasure Data operators include:
- td>: Treasure Data queries
- td_run>: Treasure Data saved queries
- td_ddl>: Treasure Data operations
- td_load>: Treasure Data bulk loading
- td_for_each>: Repeat using Treasure Data queries
- td_wait>: Waits for data arriving at Treasure Data table
- td_wait_table>: Waits for data arriving at Treasure Data table
- td_partial_delete>: Delete range of Treasure Data table
- td_table_export>: Treasure Data table export to S3
Digdag can run tasks in parallel, so you can simultaneously run independent tasks such as parameter tuning. Treasure Data Workflows enable you to make prediction tasks a periodic part of your product offerings. Having a stable way to run and evolve your machine learning processes in batches on an hourly or daily basis is a good way to evolve them and derive a better predictive model.
Example machine learning workflows with Hivemall are available at GitHub workflow-examples/machine-learning/.
Treasure Data provides an easy way to access Treasure Data data from your local Python environment with pytd. Using this method requires familiarity with:
Python, including its data structures and algorithms
pandas, an open source library for data structures and data analysis tools for Python
Jupyter, an open source library that supports interactive data science and scientific computing across all programming languages
pytd brings these tools together to help you build machine learning models to analyze Treasure Data data.
pytd is a Python library for interactive data analysis that uses Treasure Data as a data lake of time-series events, pandas for local DataFrame operations, and Jupyter to manage your analysis sessions. pytd also supports interactive analysis with Jupyter notebook.
Using pytd.pandas_td, you can fetch aggregated data from Treasure Data and move it into pandas. After creating a pandas DataFrame, you can visualize your data, and build a model with your favorite Python machine learning libraries such as scikit-learn, XGBoost, LightGBM, and TensorFlow.
Learn how to connect Treasure Data and pytd.pandas_td. You can use pytd.pandas_td for exploratory data analysis (EDA) before building machine learning models with Hivemall. For example, see the Jupyter notebook portion of this GitHubGist.