Arm Treasure Data uses the same convention as Relational Database Management Systems (RDBMSs) for managing data sets:
Unlike traditional warehousing platforms, Treasure Data allows users to store-first, then schema-later. Your schema can be changed at any time, with no cost.
Conventional warehousing platforms are schema dependent, supporting an assumptive analytics model. In this model, data elements forecasted to yield insights are defined in advance, with the structure of the data store schema.
Performance considerations are also important in initial design and the analyst must have knowledge of the underlying structure to insure query performance. When new columns are added to the table, the schema must change.
Big Data analysis however, is largely non-assumptive. The analyst seeks hidden patterns, relationships or events in the data that were not obvious from the outset. You are able to query the data where ever it is stored and without the burden of performance considerations—and exploration can create requirements for new records to support the analysis trail.
The TD Approach: Store-First, Schema-Later
Unlike traditional warehousing platforms, TD users can assign schema even after importing data to a table. This means that you can add or remove fields at any time.
This system is much more flexible, and schema changes no longer take days of work.
- Basic knowledge of Treasure Data, including the TD toolbelt.
- A table with some data. See Running a query and downloading results
Understanding the TD Default Schema
When a table is created in a TD database, it has two fields:
- time: The time that each entry was generated, in int64 UNIX time
- v: Each log entry, stored as map<string, string>
When you look up the value of a database entry, address the information using the format: v[‘field1’].
Defining a custom schema is strongly recommended.
For testing purposes, you might want to identify the various data types used in your data. For example, use the TD toolbelt to run the following query:
td query -w -d sample_datasets "SELECT v['user'], COUNT(1) AS cnt FROM www_access WHERE v['action']='login' GROUP BY v['user'] ORDER BY cnt DESC"
Defining a Custom Schema
Typically, the default schema defined by TD is acceptable. But a custom schema can make queries shorter, and greatly improve performance.
To define a custom schema for a table:
- Optionally, create testdb using the following command:
td db testdb
- Optionally, create the www_access table using the following comman:
td table:create testdb www_access
- Use the td schema:set command. The syntax of the command is:
$ td schema:set <database> <table> <column_name>:<type>...
Where <column_name> consistes of lowercase alphabets, numbers, and "_" only.
Where <type> can be one of the following:int, long, double, float, string, array<T>.
For this example, the schema would be added as follows:
td schema:set testdb www_access action:string labeles:'array<string>' user:int
You can now query this table with defined column names.
td query -w -d testdb "SELECT user, COUNT(1) AS cnt FROM www_access WHERE action='login' GROUP BY user ORDER BY cnt DESC"
Treasure Data, Presto and Hive Schema Relation
|Convert to string or int||boolean||boolean|
|string||varchar||string or varchar|
|string or Convert to long||date||string|
|string or Convert to long||timestamp||timestamp|
You can refer to the open source documentation as well: