Arm Treasure Data can be considered an “event data lake” where disparate event data sources (and a few slow moving dimensions) are aggregated and processed to create more compact and “cleaner” data packages for further processing, analysis or visualization.
Given the size and scope of an event data lake, providing highly concurrent interactive access over trillions of data points while retaining schema flexibility is technologically impossible (at least affordably). As a way to work around this limitation, a design pattern called “lakeshore data marts”.
image credit: Martin Fowler, DataLake
We named our new product “Data Tanks” to stay within the data as water metaphor as we believe provides a great mental model for how data pipelines for analytics works.
Data Tanks provide a convenient and accessible metric store to drive business intelligence and data visualization tools from Treasure Data without the burden of managing one or more separate data marts.
Create a New Data Tank
For now, creating a new Data Tank is a completely manual process performed by our operations team. We are building the necessary API and UI support and we will release a testing version as soon as that is ready. In the meantime, if you need us to create an instance, let us know the following:
- IP range from which you will be connecting to the Data Tank
- Maximum storage size needed
- Approximate number of concurrent connections you foresee making to the Data Tank
After agreeing on the instance size, the operations team provisions a new Data Tank.
To confirm Data Tanks are successfully provisioned, open a new query editor screen, check the “Output Results” box and search for “datatank” in saved connections. “datatank” and “datatank_cstore” should show up.
What is cstore?
Data Tank is available in two flavors: Row-oriented (vanilla) and Columnar (cstore). If you are not planning on using the following features of PostgreSQL within your Data Tank:
- column deletion
- column update
and you’re only planning to run analytical workloads, then you should use the cstore option which will have the following benefits:
- shrink your data footprint by 3 to 4 times on average (up to 12 times for JSON data)
- speed up your queries by an average of 30%, and another 30% more by using indexing
Exporting data to a Data Tank
Exporting data to a Data Tank is just like using a normal TD result export.
The UI of the result export in the Console shows the following fields:
- Database: the name of the destination database. Do not change this.
- Table: the name of the table within the database selected above. If the destination table does not exist, it is created on the fly.
- Mode: default is “append”. Replace, Truncate and Update are other options. See our existing documentation for more information. In the case of Update, the Unique key must be set explicitly.
- Method: this option controls the way data is written to the instance. Copy should be used for better performance, while Insert might be necessary for historical reasons.
- Schema: selects a target schema for the table. If not specified, then the default schema is used.
- Foreign Data Wrapper: use the cstore extension or not.
|We provide two type of users.
tank_integration_user: This user is for accessing to DataTanks from Treasure Data components, for example: Output result to DataTanks, pg> operator from TD Workflow. Contact support if you require a tank_integration_user password.
tank_user: This user is for accessing user side client. You can access to DataTanks like a PostgreSQL using this user.
Querying Data Tanks from Treasure Data’s console.
You can query Data Tanks using Presto, including the ability to join data across PlazmaDB (primary storage) and Data Tanks.
Some use cases include:
- Store user master data (which requires row-by-row lookups) in a Data Tank and join it against raw event data in PlazmaDB
- Cache the result of an expensive analytic query in a Data Tank table and re-use them in Presto joins
Also, this means that Data Tanks can be queries as a task within Treasure Workflow.
Querying Treasure Data’s PlazmaDB (primary storage) storage from Data Tanks
It’s possible to issue queries to Treasure Data’s PlazmaDB (primary storage) from Data Tanks directly. This approach makes your Treasure Data dataset appear as foreign tables in your data tank’s database.
Specify your API key, database, query engine type (
CREATE FOREIGN TABLE statement. You can specify either your table name or query for Treasure Data directly.
CREATE FOREIGN TABLE sample_datasets ( time integer, "user" varchar, host varchar, path varchar, referer varchar, code integer, agent varchar, size integer, method varchar ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'sample_datasets', query_engine 'presto', table 'www_access' ); SELECT code, count(1) FROM sample_datasets WHERE time BETWEEN 1412121600 AND 1414800000 -- For time index pushdown in Treasure Data GROUP BY code; code | count ------+------- 404 | 17 200 | 4981 500 | 2 (3 rows) CREATE FOREIGN TABLE nginx_status_summary ( text varchar, cnt integer ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'api_staging', query_engine 'presto', query 'SELECT c.text, COUNT(1) AS cnt FROM nginx_access n JOIN mitsudb.codes c ON CAST(n.status AS bigint) = c.code WHERE TD_TIME_RANGE(n.time, ''2015-07-05'') GROUP BY c.text' ); SELECT * FROM nginx_status_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211 : CREATE TABLE imported_summary AS SELECT * FROM nginx_status_summary; SELECT * FROM imported_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211
Also, you can specify other API endpoints.
SERVER td_server OPTIONS ( endpoint 'https://api.ybi.idcfcloud.net' apikey 'your_api_key', ...