Arm Treasure Data is an analytics infrastructure as a service. We provide a SQL-like query language interface called the Hive query language. This article covers the basics of the Hive query language
About Apache Hive
The Hive query language (HiveQL) is the primary data processing method for Treasure Data. HiveQL is powered by Apache Hive. Treasure Data is a cloud data platform that allows users to collect, store, and analyze their data on the cloud. Treasure Data manages its own Hadoop cluster, which accepts queries from users and executes them using the Hadoop MapReduce framework. HiveQL is one of the languages it supports.
No Need to Worry about Schema Changes
Treasure Data supports HiveQL semantics, but unlike Apache Hive, we allow the user to set and modify the schema at any time. We do not require table schema to be defined upfront. For more information, refer to this document.
Hive Example Query Catalog
If you’re looking for dozens of HiveQL templates, visit Treasure Data’s example query catalog page.
SELECT Statement Syntax
Here’s the syntax of Hive’s SELECT statement.
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] [LIMIT number] ;
SELECT is the projection operator in HiveQL. The points are:
SELECTscans the table specified by the
WHEREgives the condition of what to filter
GROUP BYgives a list of columns which specify how to aggregate the records
SORT BYspecify the sort order and algorithm
LIMITspecifies how many # of records to retrieve
Computing with Columns
When you select the columns, you can manipulate column values using either arithmetic operators or function calls. Math, date, and string functions are popular.
Here’s an example query that uses both operators and functions.
SELECT upper(name), sales_cost FROM products;
WHERE clause is used to filter the result set by using predicate operators and logical operators. Functions can also be used to compute the condition.
Here’s an example query that uses a WHERE clause.
SELECT name FROM products WHERE name = 'stone of jordan';
GROUP BY Clauses
GROUP BY clause is frequently used with aggregate functions, to group the result set by columns and apply aggregate functions over each group. Functions can also be used to compute the grouping key.
Here’s an example query that groups and counts by category.
SELECT category, count(1) FROM products GROUP BY category;
HAVING clause lets you filter the groups produced by
GROUP BY, by applying predicate operators to each groups.
Here’s an example query that groups and counts by category, and then retrieves only counts > 10;
SELECT category, count(1) AS cnt FROM products GROUP BY category HAVING cnt > 10;
Here are some basic examples. The underlying table consists of three fields:
# Number of Records SELECT COUNT(1) FROM www_access; # Number of Unique IPs SELECT COUNT(1) FROM ( \ SELECT DISTINCT ip FROM www_access \ ) t; # Number of Unique IPs that Accessed the Top Page SELECT COUNT(distinct ip) FROM www_access \ WHERE url='/'; # Number of Accesses per Unique IP SELECT ip, COUNT(1) FROM www_access \ GROUP BY ip LIMIT 30; # Unique IPs Sorted by Number of Accesses SELECT ip, COUNT(1) AS cnt FROM www_access \ GROUP BY ip ORDER BY cnt DESC LIMIT 30; # Number of Accesses After a Certain Time SELECT COUNT(1) FROM www_access \ WHERE TD_TIME_RANGE(time, "2011-08-19", NULL, "PDT") # Number of Accesses Each Day SELECT \ TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT") AS day, \ COUNT(1) AS cnt \ FROM www_access \ GROUP BY TD_TIME_FORMAT(time, "yyyy-MM-dd", "PDT")
|The `time` column is a special column that is always present and stores the UNIX timestamp of the log.|
INSERT Statement Syntax
Here’s the syntax of Hive’s INSERT statement.
-- append new rows to tablename1 INSERT INTO TABLE tablename1 select_statement1 FROM from_statement; -- replace contents of tablename1 INSERT OVERWRITE TABLE tablename1 select_statement1 FROM from_statement; -- more complex example using WITH clause WITH tablename1 AS (select_statement1 FROM from_statement) INSERT [OVERWRITE/INTO] TABLE tablename2 select_statement2 FROM tablename1;
- INSERT INTO will append to the table, keeping the existing data intact.
- If record doesn’t include
timecolumn is imported TD_SCHEDULED_TIME().
- If record includes
timecolumn should be Unixtime.
- All INSERT INTO statements require the FROM clause.
|All column aliases used in INSERT..SELECT statement should use a valid SQL column name to avoid failures setting the schema.|
|TreasureData HiveQL does not support Hive Multi-Table Inserts. More about INSERT OVERWRITE:|
On tables NOT receiving streaming updates, INSERT OVERWRITE will delete any existing data in the table and write the new rows.
INSERT OVERWRITE is not recommended for use on tables receiving streaming updates. INSERT OVERWRITE will not delete recently received streaming update rows or updates that arrive during the execution of INSERT OVERWRITE. If you need to perform INSERT OVERWRITE on a table tbat normally receives streaming updates, stop the streaming update before performing INSERT OVERWRITE.
Hive Syntax Checker
If you would like to experiment with the Hive SQL syntax, we made our Hive Syntax linter publicly available here.
See Other Documents
Here is a list of documents regarding the HiveQL.
- Hive Built-in Operators
- Hive Built-in Functions
- Hive Built-in Aggregate Functions
- Hive Treasure Data UDFs
- Hive Treasure Data Known Limitations
Here are the official Hive tutorial and language manual.
- The official Hive tutorial covers the basics of HiveQL.
- The official Hive language manual covers all features of HiveQL.