You can write job results to your existing Amazon Redshift cluster.
For sample workflows on Amazon Redshift for Export, view Treasure Boxes.
- Basic knowledge of Arm Treasure Data, including the TD Toolbelt
- An Amazon Redshift cluster setup and running – either a single or multi node cluster
- Query only (or better) privileges to the Treasure Data table to be queried
If a column becomes NULL during conversion, and all records are rejected.
Result Output to Redshift can export data to several regions. Current supported regions are the followings:
- us-east-1 (US Standard)
- us-west-2 (Oregon)
- eu-west-1 (Ireland)
- ap-northeast-1 (Tokyo)
- us-west-1 (N. California)
- ap-southeast-1 (Singapore)
- ap-southeast-2 (Sydney)
- sa-east-1 (São Paulo)
Below regions are not supported:
- us-east-2 (Ohio)
- ap-south-1 (Mumbai)
- ap-northeast-2 (Seoul)
- ca-central-1 (Central)
- eu-central-1 (Frankfurt)
- eu-west-2 (London)
If you have other regions for which you want support, Contact Treasure Data support.
A front-end application streams data to be collected in Treasure Data through Treasure Agent. Treasure Data periodically runs jobs on the data, then writes the job results to your Redshift cluster.
Typically, using an architecture that enables data analysts, well versed in using Redshift, allows them to focus on queries and visualizations rather than how to get the data uploaded.
You can limit the access to your database by using a list of static IPs. Contact firstname.lastname@example.org if you want to use static IP addresses.
Amazon Redshift Configuration
Amazon Redshift can be configured in:
- single node mode
- multi node mode
The multi node configuration provides more query computation power by means of parallelization of the query execution on the available nodes. Multi node configuration is commonly referred to as cluster configuration.
Result Output URL
The result output target is represented by URL in the following format:
- redshift is identified for result output to Redshift.
- username and password are the credential to the Amazon Redshift instance. These credentials are specified when first creating the Redshift cluster and they are different from the S3 public/private access keys.
- hostname can be retrieved from the Redshift configuration page. Typically, the format is:
The name is the one provided for the cluster. The instance_id is auto generated by Amazon Redshift when the cluster is created. The region is the Amazon availability zone of choice. If you are using an IP address instead of hostname, you must set
- port is the port number through which the Redshift server is accessible. “:” is optional and assumed to be 5439 by default. It may vary for multi node configurations. The actual value can be retrieved from the Redshift cluster detail page.
- database is the name of the database specified at the creation of the Redshift cluster. It can be retrieved from the Redshift cluster detail page.
- table is the name of a table within the above mentioned database. It may not exist at the moment the query output is executed. If the table does not exist, a table with the specified name is created for the user.
Result output to Redshift supports various options that can be specified as optional URL parameters. The options are compatible with each other and can be combined. Where applicable, the default behavior is indicated.
ssl option determines whether to use SSL or not for connecting to Redshift.
Use SSL from Treasure Data to Redshift connection.
Do not use SSL from Treasure Data to Redshift.
Controls the schema the target table is located. If not specified, a default schema is used. The default schema depends on your search_path setting, but it is usually public.
Specify the region where your Redshift instance is located. This option is required if your hostname does not contain the region name.
Controls the various ways of modifying the database data. The 4 modes are:
The append mode is the default which is used when no mode option is provided in the URL. In this mode, the query results are appended to the table. If the table does not exist, it is created.
The append mode is not atomic because it does not use a temporary table to store the incoming data. Instead, the append mode adds the records directly to the destination table. This is done for efficiency purposes, but might find the destination table in an intermediate state if the write fails during the result output process.
If you want the appending action to become atomic, you can use an additional minor option:
inplace=false option ensures that a temporary table is used to store the incoming data, but has some negative effects on efficiency. The inplace option defaults to true.
Append mode also does not modify the schema of existing tables. If the source table columns' schema mismatches (column names, types, or both) that of the destination table in Redshift, then the mismatching columns are dropped from the result output write. However, if the destination table does not exists, the table will be created with matching schema to the source table.
mode=append is the default behavior, these two URLs are equivalent:
The replace mode consists of replacing the entire content of an existing table with the result output of the query. If the table does not exist, a new table is created.
For fastest results, we recommend that you have only one **write** connection open to your RDS database table at a time.
The replace mode achieves atomicity (so that a consumer of the table always has consistent data) by performing the following three steps in a single transaction:
- Create a temporary table
- Write to the temporary table
- Replace the existing table with the temporary table using ALTER TABLE RENAME
Truncate mode retains the indexes of the table. With the truncate mode, the system first truncates the existing table to remove all records, then inserts the query results into the existing table without modifying the table’s schema.
If the result output table contains columns whose name or type (or both) mismatches from the schema of the destination table, then the columns are dropped, and they will not get written into the destination table. This is not a problem when the destination table does not exist: a new table will be created with schema matching that of the result output table.
The truncate mode uses a temporary table, hence achieves atomicity of write.
In the update mode, a row is inserted unless it would cause a duplicate value in the columns specified in the unique parameter. In such a case, an update is performed instead.
The unique parameter is required when using the update mode. If the table does not exist, it will be created.
This mode is atomic because it uses a temporary table to store the incoming date before attempting to insert the data in the actual destination table.
redshift://...?mode=update&unique=col1 # single unique column redshift://...?mode=update&unique=[col1,col2] # multiple unique columns
These are the options that can be used with
Append Inplace Mode
inplace minor option can control the atomicity of the action. When
true, no temporary table is used and the operation is not guaranteed to be atomic. When the option is
false, a temporary table is used. The default value is true.
Serial Copy Option
The serial_copy option determines whether to upload all the files in order one by one, to avoid some deadlocks that could happen when uploading files parallel.
Use SSL from Treasure Data to Redshift connection.
Exporting your Data
You must use the TD Toolbelt to run the query that returns the data that you want to have exported to Amazon Redshift.
One Time Export
To output the result of a single query to Amazon Redshift
- Add the
--resultoption to the
td querycommand. For example:
$ td query -w -d testdb \ --result 'redshift://username:email@example.com/database/table?mode=replace' \ "SELECT code, COUNT(1) FROM www_access GROUP BY code"
After the job is finished, the results are written into your instance.
To create a scheduled query whose output is systematically written to Redshift
- Use the td sched:create command to schedule the query.
- Add the
--resultoption. For example:
$ td sched:create hourly_count_example "0 * * * *" -d testdb \ --result 'redshift://username:firstname.lastname@example.org/database/table?mode=replace' \ "SELECT COUNT(*) FROM www_access"
If your Redshift doesn’t get any data, even when the job for Result Output to Redshift was successful
Result Output to Redshift tries to convert column type if the destination table already exists. If conversion fails, column becomes NULL, and all records are rejected if the column on Redshift is NOT NULL field.