This article explains how to write job results to your existing PostgreSQL instance.
- Basic knowledge of Arm Treasure Data, including the toolbelt.
- A PostgreSQL instance.
- Treasure Data must have SELECT, INSERT, DELETE, CREATE, DROP and ALTER privileges.
A front-end application streams data to be collected in Treasure Data via Treasure Agent. Treasure Data periodically runs jobs on the data, then writes the job results to your Postgres tables.
Shown above is a fairly common architecture. Here are a few examples.
Example 1: Ranking: What are the “Top N of X?”
Every social/mobile application calculates the “top N of X” (ex: top 5 movies watched today). Treasure Data already handles the raw data warehousing; the “write-to-PostgreSQL” feature enables Treasure Data to find the “top N” data as well.
Example 2: Dashboard Application
If you’re a data scientist, you need to keep track of a range of metrics every hour/day/month and make them accessible via visualizations. Using this “write-to-PostgreSQL” feature, you can streamline the process and focus on your queries and your visualizations of the query results.
|You can limit the access to your database by using a list of static IPs. Please contact firstname.lastname@example.org if you need it.|
Result Output URL Format
The result output target is represented by URL with the following format:
- postgresql is identified for result output to Postgres;
- username and password are the credential to the Postgres server;
- hostname is the host name of the Postgres server;
- port is the port number through which the Postgres server is accessible. “:” is optional and assumed to be 5432 by default;
- database is the name of the destination database;
- table is the name of a table within the above mentioned database. It may not exist at the moment the query output is execute, in which case a table with the specified name will be created for the user.
Result output to Postgres supports various options that can be specified as optional URL parameters. The options are compatible with each other and can be combined. Where applicable, the default behavior is indicated.
ssl option determines whether to use SSL or not for connecting to the Postgres server.
Use SSL from Treasure Data to the Postgres server connection. The Postgres server must be configured to accept SSL connection.
Do not use SSL from Treasure Data to the Postgres server connection.
Controls the schema the target table is located. If not specified default schema is to be used. The default schema depends on user’s “search_path” setting but it is usually “public”.
Controls the various ways of modifying the database data. All 4 supported modes are atomic because they use a temporary table to store the incoming data before attempting to modify the destination table:
The append mode is the default which is used when no mode option is provided in the URL. In this mode the query results are appended to the table. If the table does not exist, it will be created.
mode=append is the default behavior, these two URLs are equivalent:
The replace mode consists of replacing the entire content of an existing table with the result output of the query. If the table does not exist yet, a new table will be created. The replace mode achieves atomicity (so that a consumer of the table always has consistent data) by performing the following three steps in a single transaction:
- Create a temporary table;
- Write to the temporary table;
- Replace the existing table with the temporary table using ALTER TABLE RENAME.
With the truncate mode the system first truncates the existing table, then inserts the query results. If the table does not exist yet, a new table will be created.
|Unlike replace, the truncate mode retains the indexes of the table.|
In the update mode a row is inserted unless it would cause a duplicate value in the columns specified in the “unique” parameter: in such case an update is performed instead. Please note that the “unique” parameter is required when using the update mode.
postgresql://...?mode=update&unique=col1 # single unique column postgresql://...?mode=update&unique=[col1,col2] # multiple unique columns
The method option controls how the data is written into the Postgres table.
The default method is insert; it is also the recommended method for most situations.
With the insert method, data is written to Postgres using ‘INSERT’ statements. This is the most reliable and compatible method and it is recommended for most situations.
method=insert is the default behavior, these two URLs are equivalent:
When the copy method is used the data is first stored in a temporary file on the server, then written to Postgres using a COPY transaction. This method is faster than INSERT, so it is useful when handling a large amount of data.
To output the result of a single query to a Postgres server add the
--result option to the
td query command. After the job is finished, the results will be written into your database:
$ td query -w -d testdb \ --result 'postgresql://user:password@host/database/table' \ "SELECT code, COUNT(1) FROM www_access GROUP BY code"
To create a scheduled query whose output is systematically written to Postgres add the
--result option when creating the schedule through
td sched:create command:
$ td sched:create hourly_count_example "0 * * * *" -d testdb \ --result 'postgresql://user:password@host/database/table' \ "SELECT COUNT(*) FROM www_access"