Use Presto, via TD Console, to query data stored in your Arm Treasure Data Data Tank.
This optional feature enables users of Data Tanks to run queries against their Data Tank instance directly from the Presto query engine.
What is a Data Tank?
Data Tanks provide easy access to your aggregated metrics through convenient, fully hosted data marts on Treasure Data’s core platform. They can be used to drive a variety of external business intelligence and visualization applications without having to host and maintain your own PostgreSQL instances.
Data Tanks are PostgreSQL databases that use a columnar store to accelerate analytical queries. They are completely managed by Treasure Data including creation, setup, monitoring, management and troubleshooting so you can just get your job done.
Use of Data Tanks provides the following key benefits:
- Enables easier access to DataTank data via querying from Treasure Data Console, via Presto. Including the creation and deletion of tables.
- More flexible processing flows by enabling data to be joined across DataTank and core TD tables.
- Lists of tables or table preview functions are not supported in the query editor. You can still explore tables and preview tables by issuing queries against Data Tank system tables and normal tables.
- You cannot access tables that use mixed case. Presto is a lowercase character system. Accessing case-sensitive systems like Postgres through Presto is a known issue.
Presto doesn't support numeric types in data tanks.
To use decimal data in data tanks with Presto, specify the "real" or "double precision" type.
Queries from Presto against Data Tank tables that use aggregates (COUNT, MAX, SUM, etc.) may not generate optimal SQL for PostgreSQL. Presto does not push the aggregation down to the Data Tank PostgreSQL environment. Instead, all the rows needed to compute the aggregate are returned to Presto, and the aggregation is handled on a single node. If the source table is large enough, the required data movement and computation takes too long and TD may kill the query with a message like:
[INTERNAL_FAILURE] Killed by the system because this query stalled for more than 1.00h. This is usually caused by a bug of Presto. The query execution will be retried in several minutes.Please ask email@example.com if you need a further help for this query
To workaround the limitation, create a view in PostgreSQL that processes the aggregate on the PostgreSQL side, and query that view from the Presto side.
For example, a query against a 100M row Data Tank table giant_table is likely to fail:
select COUNT(*) as giant_count from datatank.public.giant_table;
A more efficient approach is to create a view in PostgreSQL that performs the resource-intensive processing in Data Tank:
create view count_view as select count(*) as cnt from giant_table;
Then execute a query on the view from Presto:
select cnt as giant_count from datatank.public.count_view;
How to Use
Data Querying (DML Queries)
SELECT query against a table
select avg(close) from datatank.public.normal_nasdaq where symbol = 'AAPL'; _col0 -------------------- 14.823083442853807 (1 row)
SELECT query that joins Data Tank and TD Core tables
-- sample_datasets.nasdaq is a TD table -- sample_datasets. can be left out if you specify it at '-d' option in command-line or select it at the database dropdown list select count(distinct time) from datatank.public.normal_nasdaq dt_nasdaq JOIN sample_datasets.nasdaq td_nasdaq ON dt_nasdaq.symbol = td_nasdaq.symbol AND dt_nasdaq.open = td_nasdaq.open AND dt_nasdaq.close = td_nasdaq.close where dt_nasdaq.symbol = 'AAPL';
INSERT INTO a table in the data tank
insert into datatank.public.ms_nasdaq (symbol, open, volume, high, low, close, time) values ('AAPL', 10, 100, 20, 5, 7, td_scheduled_time());
INSERT INTO table from TD Table
insert into datatank.public.ms_nasdaq select * from sample_datasets.nasdaq where symbol = 'AAPL' ;
- UPDATE against table in Data Tank (Not supported in Presto). Use Result Output to Data Tank for this purpose.
- DELETE against table in Data Tank (Not supported in Presto PostgreSQL connector)
Database Management Queries (DDL Queries)
show schemas from datatank; Schema -------------------- information_schema pg_catalog public
List tables in schema
show tables from datatank.public ; Table --------------- normal_nasdaq
List columns of a table
desc datatank.public.normal_nasdaq; Column | Type | Comment --------+---------+--------- symbol | varchar | open | double | volume | bigint | high | double | low | double | close | double |
CREATE TABLE AS in Data Tank from TD Table data
create table datatank.public.ms_nasdaq as select * from sample_datasets.nasdaq where symbol = 'MSFT' ;
drop table datatank.public.ms_nasdaq;
- Plain CREATE TABLE (CREATE TABLE AS SELECT is supported) in Data Tank
- ALTER TABLE statement in Data Tank
- DROP FOREIGN TABLE statement in Data Tank
When this querying functionality is turned on, Account Owner and Admin users are given “Full Access” permissions.
All other users can be granted “Full Access” or “Query Only”.
- “Full Access” includes both DML and DDL query usage
- “Query Only” includes only DML query usage
Administrators can grant these access levels in Team section of TD Console, by selecting the database named ‘datatank’. The database ‘datatank’ is only available if your account has had this feature turned on by request.
Key Details and Questions Answered
Getting Access to this Functionality
This feature can be accessed by customers of our Data Tank add-on, at no additional cost.
Contact your account representative or Support.
Presto Query Processing Flow
A typical processing flow for a Presto query that joins data from a TD table with data from a Data Tank table is:
- Writes a query in Presto (within the TD Console)
- The Presto query begins running
- The Presto query generates and submits a sub-query to the Data Tank PostgreSQL instance
- The Data Tank instance runs that query, and returns the results to the Presto cluster
- The Presto cluster finishes running the query, completing the join of the data from the core TD dataset and the data returned from the PostgreSQL query result
- The query result is displayed in the TD Console. It can be submitted to a result output job, or downloaded from the browser, just like the result of any other Presto query
Can I issue queries that join data across TD core and Data Tank?
You can create queries that join TD core stored data and Data Tank stored data.
What happens when someone runs TD_TIME_RANGE(), or other TD specific UDFs, against data in Data Tank?
Presto and Treasure Data specific functions are run only after data is returned from the Data Tank instance. As such, TD_TIME_RANGE() or other functions are not “pushed down” the Data Tank query issued from Presto.
In general, we recommend not using such functions because it’s more performant to run queries using PostgreSQL friendly syntax. By using PostgreSQL friendly syntax, much of the processing can be completed on the Data Tank instance.
If you do run a TD or Presto specific function, the query still runs, but it may be less performant than if the WHERE function is effectively “pushed down” into the PostgreSQL query issued from Presto.
How are Data Tanks converted if the data type doesn’t exist in Treasure Data core? Like timestamp, float, varchar.
Presto as a query engine supports more data types than TD’s core tables. As such, when querying against PosgreSQL, our presto query engine is able to handle types such as timestamp. Full type support by Presto can be seen here: https://prestosql.io/docs/current/language/types.html
If you are joining data, you may need to run a type conversion function to ensure the TD data type aligns with the data type you are joining from the Data Tank instance. But, you may need to run a type conversion when joining against any other 2 tables as well.