Arm Treasure Data can be considered an “event data lake” where disparate event data sources (and a few slow moving dimensions) are aggregated and processed to create more compact and “cleaner” data packages for further processing, analysis or visualization.
Given the size and scope of an event data lake, providing highly concurrent interactive access over trillions of data points while retaining schema flexibility is technologically impossible (at least affordably). As a way to work around this limitation, there is a design pattern called “lakeshore data marts”. For example:
image credit: Martin Fowler, DataLake
Our Data Tanks use the data as water metaphor to provide a mental model for how data pipelines for analytics works.
Data Tanks provide a convenient and accessible metric store to drive business intelligence and data visualization tools from Treasure Data without the burden of managing one or more separate data marts.
Data Tank is available as:
- Columnar (cstore)
If you are not planning on using the following features of PostgreSQL within your Data Tank:
- column deletion
- column update
and you’re only planning to run analytical workloads, then you should use the cstore option which will have the following benefits:
- shrink your data footprint by 3 to 4 times on average (up to 12 times for JSON data)
- speed up your queries by an average of 30%, and another 30% more by using indexing
|We provide two type of users.
tank_integration_user: This user is for accessing DataTanks from Treasure Data components. For example: Output result to DataTanks, pg> operator from TD Workflow. Contact support if you require a tank_integration_user password.
tank_user: This user is for accessing user side client. You can access DataTanks like PostgreSQL using this user.
Creating a New Data Tank
- Creating a new Data Tank is a process performed by our operations team.
- Contact your TD representative and communicate the following:
- IP range from which you will be connecting to the Data Tank
- Maximum storage size needed
- Approximate number of concurrent connections you foresee making to the Data Tank
- After agreeing on the instance size, the operations team provisions a new Data Tank.
- You can confirm that your Data Tanks are successfully provisioned:
- open a new query editor screen,
- select Export Results
- search for “datatank” in saved connections. “datatank” and “datatank_cstore” should show up.
Exporting data to a Data Tank
- Locate a query for which Export Results is selected, or select Export Results.
- Review, add, or edit the values in the fields.
Database: the name of the destination database. Do not change this.
Table: the name of the table within the database. If the destination table does not exist, it is created.
Mode: default is “append”. Replace, Truncate and Update are other options. In the case of Update, the Unique key must be set explicitly.
Method: this option controls the way data is written to the instance. Copy should be used for better performance, while Insert might be necessary for historical reasons.
Schema: selects a target schema for the table. If not specified, then the default schema is used.
Foreign Data Wrapper: use the cstore extension or not.
Allow the query to run according to the defined schedule or run it immediately.
Querying Data Tanks using the Treasure Data Console
You can query Data Tanks using Presto, including the ability to join data across PlazmaDB (primary storage) and Data Tanks.
Some use cases include:
- Store user master data (which requires row-by-row lookups) in a Data Tank and join it against raw event data in PlazmaDB
- Cache the result of an expensive analytic query in a Data Tank table and re-use them in Presto joins
Also, this means that Data Tanks can be queries as a task within Treasure Workflow.
Querying Treasure Data’s PlazmaDB (primary storage) storage from Data Tanks
It’s possible to issue queries to Treasure Data’s PlazmaDB (primary storage) from Data Tanks directly. This approach makes your Treasure Data dataset appear as foreign tables in your data tank’s database.
Specify your API key, database, query engine type (
CREATE FOREIGN TABLE statement. You can specify either your table name or query for Treasure Data directly.
CREATE FOREIGN TABLE sample_datasets ( time integer, "user" varchar, host varchar, path varchar, referer varchar, code integer, agent varchar, size integer, method varchar ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'sample_datasets', query_engine 'presto', table 'www_access' ); SELECT code, count(1) FROM sample_datasets WHERE time BETWEEN 1412121600 AND 1414800000 -- For time index pushdown in Treasure Data GROUP BY code; code | count ------+------- 404 | 17 200 | 4981 500 | 2 (3 rows) CREATE FOREIGN TABLE nginx_status_summary ( text varchar, cnt integer ) SERVER td_server OPTIONS ( apikey 'your_api_key', database 'api_staging', query_engine 'presto', query 'SELECT c.text, COUNT(1) AS cnt FROM nginx_access n JOIN mitsudb.codes c ON CAST(n.status AS bigint) = c.code WHERE TD_TIME_RANGE(n.time, ''2015-07-05'') GROUP BY c.text' ); SELECT * FROM nginx_status_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211 : CREATE TABLE imported_summary AS SELECT * FROM nginx_status_summary; SELECT * FROM imported_summary; text | cnt ---------------+---------- OK | 10123456 Forbidden | 12 Unauthorized | 3211
Also, you can specify other API endpoints.
SERVER td_server OPTIONS ( endpoint 'https://api.ybi.idcfcloud.net' apikey 'your_api_key', ...