TD Hive 2 Performance Tuning
See Hive Performance Tuning which includes:
- Leveraging Time-based Partitioning
- WHERE time <=> Integer
- Set Custom Schema
- DISTRIBUTE BY…SORT BY v. ORDER BY
- Avoid “SELECT count(DISTINCT field) FROM tbl”
- Considering the Cardinality within GROUP BY
- Efficient Top-k Query Processing using each_top_k
- Exploding multiple arrays at the same time with TD_NUMERIC_RANGE and TD_ARRAY_INDEX
Rewrite queries for better performance
When running queries against tables that have terabytes worth of data, tuning your queries becomes an important mechanism to keep your systems in peak performance.
Sometimes a query is using too many mappers and can be rewritten to reduce the number of mappers that it requires. Or, if possible the query can be modified to take advantage of Tez or Spark.
hive.optimize.countdistinct is a parameter that determines whether to rewrite count distinct into 2 stages. The first stage uses multiple reducers with the count distinct key and the second stage uses a single reducer without key.
Query hints can help with performance
Hints let you make decisions usually made by the query engine. Because you know your data, you might know things that the optimizer does not know. Hints provide a mechanism to direct the query engine to process a query based on the criteria that you specify.