Arm Treasure Data provides a SQL-like query language interface called the Hive query language.
The Hive query language (HiveQL) is the primary data processing method for Treasure Data. HiveQL is powered by Apache Hive. Treasure Data is a CDP that allows users to collect, store, and analyze their data on the cloud. Treasure Data manages its own Hadoop cluster, which accepts queries from users and executes them using the Hadoop MapReduce framework. HiveQL is one of the languages it supports.
No Need to Worry about Schema Changes
Treasure Data supports HiveQL semantics, but unlike Apache Hive, we allow the user to set and modify the schema at any time. We do not require that a table schema be defined upfront.
Hive Example Query Catalog
If you’re looking for HiveQL templates, visit Treasure Data’s example query catalog page:
Hive SELECT Statement Syntax
Here’s the syntax of Hive’s SELECT statement.
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] ;
SELECT is the projection operator in HiveQL. The points are:
SELECTscans the table specified by the
WHEREgives the condition of what to filter
GROUP BYgives a list of columns which specify how to aggregate the records
SORT BYspecify the sort order and algorithm
LIMITspecifies how many # of records to retrieve
Computing with Columns
When you select the columns, you can manipulate column values using either arithmetic operators or function calls. Math, date, and string functions are also popular.
Here’s an example query that uses both operators and functions.
SELECT upper(name), sales_cost FROM products;
WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition.
Here’s an example query that uses a WHERE clause.
SELECT name FROM products WHERE name = 'stone of jordan';
GROUP BY Clauses
GROUP BY clause is frequently used with aggregate functions, to group the result set by columns and apply aggregate functions over each group. Functions can also be used to compute the grouping key.
Here’s an example query that groups and counts by category.
SELECT category, count(1) FROM products GROUP BY category;
HAVING clause lets you filter the groups produced by
GROUP BY, by applying predicate operators to each groups.
Here’s an example query that groups and counts by category, and then retrieves only counts > 10;
SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;
Here are some basic examples. The underlying table consists of three fields:
# Number of Records SELECT COUNT(1) FROM www_access; # Number of Unique IPs SELECT COUNT(1) FROM ( \ SELECT DISTINCT ip FROM www_access \ ) t; # Number of Unique IPs that Accessed the Top Page SELECT COUNT(distinct ip) FROM www_access \ WHERE url='/'; # Number of Accesses per Unique IP SELECT ip, COUNT(1) FROM www_access \ GROUP BY ip LIMIT 30; # Unique IPs Sorted by Number of Accesses SELECT ip, COUNT(1) AS cnt FROM www_access \ GROUP BY ip ORDER BY cnt DESC LIMIT 30; # Number of Accesses After a Certain Time SELECT COUNT(1) FROM www_access \ WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT") # Number of Accesses Each Day SELECT \ TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT") AS day, \ COUNT(1) AS cnt \ FROM www_access \ GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT")
|The `time` column is a special column that is always present and stores the UNIX timestamp of the log.|
INSERT Statement Syntax
Here’s the syntax of Hive’s INSERT statement.
-- append new rows to tablename1 INSERT INTO TABLE tablename1 select_statement1 FROM from_statement; -- replace contents of tablename1 INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement; -- more complex example using WITH clause WITH tablename1 AS (select_statement1 FROM from_statement) INSERT [OVERWRITE/INTO] TABLE tablename2 select_statement2 FROM tablename1;
- INSERT INTO will append to the table, keeping the existing data intact.
- If the record doesn’t include a
timecolumn is imported usibg TD_SCHEDULED_TIME().
- If the record includes a
timecolumn should be Unixtime.
- All INSERT INTO statements require the FROM clause.
- All column aliases used in INSERT...SELECT statement should use a valid SQL column name to avoid failures setting the schema.
- Treasure Data HiveQL does not support Hive Multi-Table Inserts.
- On tables NOT receiving streaming updates, INSERT OVERWRITE will delete any existing data in the table and write the new rows.
- INSERT OVERWRITE is not recommended for use on tables receiving streaming updates. INSERT OVERWRITE will not delete recently received streaming update rows or updates that arrive during the execution of INSERT OVERWRITE. If you need to perform INSERT OVERWRITE on a table that normally receives streaming updates, stop the streaming update before performing INSERT OVERWRITE.
Hive Reference Document Links
If you would like to experiment with the Hive SQL syntax, we made our Hive Query Language Reference.
Other documents regarding the HiveQL:
- Hive Built-in Operators
- Hive Built-in Functions
- Hive Built-in Aggregate Functions
- Hive Treasure Data UDFs
- Hive Treasure Data Known Limitations
Here are the official Hive tutorial and language manual: