- Basic knowledge of Arm Treasure Data.
- Basic knowledge of Presto query engine.
Only specifying the columns you need
Actual data in Treasure Data is stored as a columnar storage format, which is optimized for the query using only specific columns. Restricting accessed columns can improve your query performance significantly. Specify only needed columns instead of using a wildcard (*).
[GOOD]: SELECT time,user,host FROM tbl [BAD]: SELECT * FROM tbl
Leveraging Time-based partitioning
All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record. By specifying the time range in your query, you can avoid reading unnecessary data and can thus speed up your query significantly.
WHERE time <=> Integer
When the ‘time’ field within the WHERE clause is specified, the query parser automatically detects which partitions should be processed. This auto detection does not work if you specify the time with
float instead of
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000
An easier way to slice data is to use TD_TIME_RANGE UDF.
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01 PDT') [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', NULL, 'PDT') [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', TD_TIME_ADD('2013-01-01', '1day', 'PDT'))
However, if you use division in TD_TIME_RANGE, time partition opimization doesn’t work. For instance, the following conditions disable optimization.
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400)) [BAD]: SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))
TD_INTERVAL UDF is also available to slice data. TD_INTERVAL provides an intuitive way to specify the time range.
# The last 7 days
[GOOD]: SELECT ... WHERE TD_INTERVAL(time, '-7d')
Considering the cardinality within GROUP BY
It's possible to improve the performance of the GROUP BY function by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
[GOOD]: SELECT GROUP BY uid, gender [BAD]: SELECT GROUP BY gender, uid
Another tip for using GROUP BY: use numbers instead of strings for the GROUP BY column because numbers require less memory are faster to compare than strings.
Use LIMIT with ORDER BY
ORDER BY requires that all rows be sent to a single worker which then sorts them. ORDER BY might often require a lot of memory on a Presto worker. When you want to look up the top or bottom N records, use LIMIT which can reduce the sort cost and memory pressure.
[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100 [BAD]: SELECT * FROM tbl ORDER BY time
Using approximate aggregate functions
Presto has a couple of approximate aggregation functions, that give you significant performance improvements, but with some errors. For example, by using
approx_distinct() function, you can get an approximation of
COUNT(DISTINCT x) with a standard error of 2.3%. The following example gives an approximate count of the previous day’s unique users.
SELECT approx_distinct(user_id) FROM access WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d', 'PDT'), TD_SCHEDULED_TIME())
Aggregating a series of LIKE clauses in one single regexp_like clause
Presto’s query optimizer is unable to improve queries that contain many LIKE clauses. As a consequence the query execution might be slower than expected.
To improve the performance,you can substitute a series of LIKE clauses that are chained with the OR condition with a single regexp_like clause, which is Presto native.
SELECT ... FROM access WHERE method LIKE '%GET%' OR method LIKE '%POST%' OR method LIKE '%PUT%' OR method LIKE '%DELETE%'
can be optimized by replacing the four LIKE clauses with a single regexp_like clause:
SELECT ... FROM access WHERE regexp_like(method, 'GET|POST|PUT|DELETE')
Specifying large tables first in join clause
The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the partitions. This type of join works when your right-hand side table is small enough to fit within one node (usually less than 2GB). If you receive an ‘Exceeded max memory xxGB’ error, then the right-hand side table is too large. Presto does not perform automatic join-reordering, so make sure your large table precedes small tables in any join clause.
Turning on the distributed hash join
If you still see the memory issue, try a distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join works even if the right-hand side table is large, but the performance might be slower because the join increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
DROP TABLE IF EXISTS my_result; -- set session distributed_join = 'true' CREATE TABLE my_result AS
SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
The query processing may be slow if there is a skew of the key values in your data.
Improving Performance of Very Large Queries
You can parallelize the query result output process by using the CREATE TABLE AS SELECT statement. If you DROP the table before running the query, your performance is significantly better. The result output performance will be 5x faster than running SELECT *. Treasure Data Presto skips the JSON output process and directly produces a 1-hour partitioned table.
Without using DROP TABLE, Presto uses JSON text to materialize query results. And if the result table contains 100GB of data, the coordinator transfers more than 100GB of JSON text to save the query result. So, even if the query computation is almost finished, output of theJSON results takes a long time.
To clean up the result table beforehand:
- Add a DROP TABLE statement at the top of your query.
- Use CREATE TABLE (table) AS SELECT …
For example, your query might look like this:
DROP TABLE IF EXISTS my_result; CREATE TABLE my_result AS
SELECT * FROM my_table;