This article explains how to merge data from and to Treasure Data tables.
- Basic knowledge of Treasure Data, including the toolbelt
- Basic understanding of Job Result Output to Treasure Data
- Merging from one table into an existing table
- Merging from two tables into a third table
consist of querying the data from one or more source tables and writing the result to a new or existing table using Job Result Output to Treasure Data .
Merging One Table into Another Existing Table
In the following scenario, both the source and the destination table exist and contain data. Merging consists of querying the data from the source table and writing the result into the destination table using the Treasure Data Data Exchange connector.
Let’s assume the source table is called src_dbs.src_tbl and has a schema:
$ td schema:show src_dbs src_tbl src_dbs.src_tbl ( common_col:string src_col_1:string src_col_2:long )
and the destination table dst_dbs.dst_tbl has a schema:
$ td schema:show dst_dbs dst_tbl dst_dbs.dst_tbl ( common_col:string dst_col_1:string dst_col_2:long )
The data belonging to columns having the same name in the source and destination table will be merged – this includes the time column or its equivalent. Source table columns that don’t exist in the destination table are instead just copied over.
To merge the tables run this command in the CLI:
$ td query -d src_dbs -w -r 'td://@/dst_dbs/dst_tbl' \ 'SELECT `time`, src_col_1, src_col_2, common_col FROM src_tbl'
|We select the time column is used in order to preserve the original timestamps when writing the result to the destination table. If the time column is omitted, the time at which the query is ran is used instead.|
By using td://@/dst_dbs/dst_tbl as a job result output target, the result is written in append mode, appending the result of this query to the data already in the destination table dst_dbs.dst_tbl. After the query completes, the schema of the destination table is updated to include the columns in the source table’s result (that is, those columns indicated in the select clause of the query) that did not exist before; the new schema will be:
$ td schema:show dst_dbs dst_tbl dst_dbs.dst_tbl ( common_col:string dst_col_1:string dst_col_2:long src_col_1:string src_col_2:long )
Clearly data for both time and common_col was merged in the process. If you want to avoid merging common_col in the destination table with common_col in the source table, perhaps because the two columns contain data with different meanings that should not mixed, an easy way to do so is to rename the source column in the ‘SELECT’ clause with the old_name AS new_name diction:
$ td query ... 'SELECT `time`, src_col_1, src_col_2, common_col AS src_common_col FROM src_tbl'
Merging Two Tables Into A Third Table
In the following scenario, you merge data from two tables. Whether writing into a new or existing table, the approach is very similar. The schema caveats discussed previously apply here as well.
Let’s assume the two source tables are called src_dbs_1.src_tbl_1 and src_dbs_2.src_tbl_2 and have a schema:
$ td schema:show src_dbs_1 src_tbl_1 src_dbs_1.src_tbl_1 ( common_col:string src_col_1_1:string src_col_1_2:long ) $ td schema:show src_dbs_2 src_tbl_2 src_dbs_2.src_tbl_2 ( common_col:string src_col_2_1:string src_col_2_2:long )
For simplicity let’s assume the destination database dst_dbs exists but the destination table dst_dbs.dst_tbl does not (for schema update simplification).
This query merges the data from the two tables into the new destination table, which is created automatically when the result of the query is written to the destination.
$ td query -d src_dbs_1 -w -r 'td://@/dst_dbs/dst_tbl' \ 'SELECT \ * \ FROM ( \ SELECT \ tbl_1_alias.time, \ tbl_1_alias.common_col, \ tbl_1_alias.src_col_1_1, \ tbl_1_alias.src_col_1_2, \ NULL as src_col_2_1, \ NULL as src_col_2_2 \ FROM \ src_dbs_1.src_tbl_1 tbl_1_alias \ UNION ALL \ SELECT \ tbl_2_alias.time, \ tbl_2_alias.common_col, \ NULL as src_col_1_1, \ NULL as src_col_1_2 \ tbl_2_alias.src_col_2_1, \ tbl_2_alias.src_col_2_2 \ FROM \ src_dbs_2.src_tbl_2 tbl_2_alias \ ) tbl_1_union_tbl_2'
Note the following details:
- We use an UNION ALL to merge the content of the columns with identical names in the two source tables (common_col in this example) without elimination or de-duplication – UNION ALL also preserves the original timestamps of the records from both tables by including them in the two inner select clauses for each respective tables. Therefore, even if there were no columns to be merged, UNION ALL is still needed to merge the time column.
- In an UNION ALL both sides of the union must contain the same number, name, and type of columns. Therefore, we need to specify the second source table’s columns missing in the first source table as NULL as tbl_2_col_X to act as column placeholders and vice versa. This process can very easily become complicated if the tables contain many unrelated columns;
- If the two source tables reside in the same database, one can drop the database specified in the two (or more) inner FROM clauses;
- Both source tables need to be aliased with a ‘src_dbs_2.src_tbl_2 tbl_2_alias’ syntax and the tbl_2_alias is used to namespace columns in either table;
- The last alias at the end and outside of the round parenthesis is an arbitrary name, required by syntax.
If the destination table existed and contained data, the query can get a bit more complicated. Special care must be taken to make sure no unwanted merging or overwriting of columns happens in the process. The actual query can vary dramatically, depending on the specific use case.