- Basic knowledge of Arm Treasure Data.
- Basic knowledge of Presto query engine.
Only specifying the columns you need
Actual data in Treasure Data is stored as a columnar storage format, which is optimized for the query using only specific columns. So restricting accessed columns can improve your query performance significantly. Please specify the needed columns instead of using a wildcard (*).
[GOOD]: SELECT time,user,host FROM tbl [BAD]: SELECT * FROM tbl
Leveraging Time-based partitioning
All imported data is automatically partitioned into hourly buckets, based on the ‘time’ field within each data record. By specifying the time range in your query, you can avoid reading unnecessary data and can thus speed up your query significantly.
1.) WHERE time <=> Integer
When the ‘time’ field within the WHERE clause is specified, the query parser will automatically detect which partition(s) should be processed. Please note that this auto detection will not work if you specify the time with
float instead of
[GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 + 3600 [GOOD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020 - 3600 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 13493930200 / 10 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time > 1349393020.00 [BAD]: SELECT field1, field2, field3 FROM tbl WHERE time BETWEEN 1349392000 AND 1349394000
An easier way to slice data is to use TD_TIME_RANGE UDF.
[GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01 PDT') [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', NULL, 'PDT') [GOOD]: SELECT ... WHERE TD_TIME_RANGE(time, '2013-01-01', TD_TIME_ADD('2013-01-01', '1day', 'PDT'))
However, if you use division in TD_TIME_RANGE, time partition opimization doesn’t work. For instance, the following conditions disable optimization.
[BAD]: SELECT ... WHERE TD_TIME_RANGE(time, TD_SCHEDULED_TIME() / 86400 * 86400)) [BAD]: SELECT ... WHERE TD_TIME_RANGE(time, 1356998401 / 86400 * 86400))
Considering the cardinality within GROUP BY
There’s a probability where GROUP BY becomes a little bit faster, by carefully ordering a list of fields within GROUP BY in an order of high cardinality.
[GOOD]: SELECT GROUP BY uid, gender [BAD]: SELECT GROUP BY gender, uid
Another tips about GROUP BY is to use number instead of strings as GROUP BY column because numbers require less memory are faster to compare than strings.
Use LIMIT with ORDER BY
ORDER BY requires all rows are sent to a single worker and then sort them. It may often require a lot of memory on a Presto worker. When you want to look up the top or bottom N records, please use LIMIT which can reduce the sort cost and memory pressure.
[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100 [BAD]: SELECT * FROM tbl ORDER BY time
Using approximate aggregate functions
Presto has a couple of approximate aggregation functions, which will give you significant performance improvements with some errors. For example, by using
approx_distinct() function, you can get an approximation of
COUNT(DISTINCT x) with standard error of 2.3%. The example below gives an approximate count of the previous day’s unique users.
SELECT approx_distinct(user_id) FROM access WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d', 'PDT'), TD_SCHEDULED_TIME())
Aggregating a series of LIKE clauses in one single regexp_like clause
Presto’s query optimizer is unable to improve queries where many LIKE clauses are used. As a consequence the query execution can be slower than expected in this case.
To improve the performance, one can substitute a series of LIKE clauses chainied in an OR with a single regexp_like clause, which is Presto native.
SELECT ... FROM access WHERE method LIKE '%GET%' OR method LIKE '%POST%' OR method LIKE '%PUT%' OR method LIKE '%DELETE%'
can be optimized by replacing the 4 LIKE clauses with a single regexp_like clause:
SELECT ... FROM access WHERE regexp_like(method, 'GET|POST|PUT|DELETE')
Specifying large tables first in join clause
The default join algorithm of Presto is broadcast join, which partitions the left-hand side table of a join and sends (broadcasts) a copy of the entire right-hand side table to all of the worker nodes that have the paritions. This works when your right table is small enough to fit within one node (usually less than 2GB). If you observe ‘Exceeded max memory xxGB’ error, this usually means the right-hand side table is too large. Presto does not perform automatic join-reordering, so please make sure your large table preceeds small tables in any join clause.
Turning on the distributed hash join
If you stil see the memory issue, try distributed hash join. This algorithm partitions both the left and right tables using the hash values of the join keys. So the distributed join would work even if the right-hand side table is large, but the performance can be slower because it increases the number of network data transfers. To turn on the distributed join, embed the following session property as an SQL comment:
-- set session distributed_join = 'true' SELECT ... FROM large_table l, small_table s WHERE l.id = s.id
Note that however if there is a skew of the key values in your data, some partition needs to process larger data set compared to the others. This might slow down the query processing.
Use RANK function instead of ROW_NUMBER to get TOP N records
In some of cases, you would like to pull only latest/top N records. But,
row_number() is slower than equivalent
-- BAD QUERY SELECT checksum(rnk) FROM ( SELECT row_number() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem ) t WHERE rnk = 1
Using RANK function would be better performance than ROW_NUMBER FUNCTION.
-- GOOD QUERY SELECT checksum(rnk) FROM ( SELECT rank() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem ) t WHERE rnk = 1