Arm Treasure Data's scheduler feature supports periodic query execution. We take great care in distributing and operating our scheduler in order to achieve high availability. By using this feature, you no longer need a
cron daemon within your local data center.
- Basic knowledge of Treasure Data
- A table with some data.
Create the Schedule (Web Console)
You can create a new schedule using the Web Console > Queries. Select the “New Query” pane, and set a schedule for the query.
Predefined Cron schedules are as follows:
|@hourly||Run once an hour||0 * * * *|
|@daily||Run once a day at midnight||0 0 * * *|
|@monthly||Run once a month at midnight on the morning of the first day of the month||0 0 1 * *|
For any other scheduling definition, select the ‘Other cron…’ option. The cron specification format is based on the cron-spec gem. The required fields are:
* * * * * - - - - - | | | | | | | | | +----- day of week (0 - 6) (Sunday=0) | | | +---------- month (1 - 12) | | +--------------- day of month (1 - 31) | +-------------------- hour (0 - 23) +------------------------- min (0 - 59)
The following named entries can be used:
- Day of Week: sun, mon, tue, wed, thu, fri, sat
- Month: jan, feb, mar, apr, may, jun, jul, aug, sep, oct, nov, dec
A single space is required between each field. The values for each field can be composed of:
- a single value, within the limits displayed above for each field.
- a wildcard ‘*’ to indicate no restriction based on the field. E.g. ‘0 0 1 * *’ configures the schedule to run at midnight (00:00) on the first day of each month.
- a range ‘2-5’, indicating the range of accepted values for the field. E.g. ‘0 0 1-10 * *’ configures the schedule to run at midnight (00:00) on the first 10 days of each month.
- a list of comma-separated values ‘2,3,4,5’, indicating the list of accepted values for the field. E.g. ‘0 0 1,11,21 * *’ configures the schedule to run at midnight (00:00) every 1st, 11th, and 21st day of each month.
- a periodicity indicator ‘*/5’ to express how often based on the field’s valid range of values a schedule is allowed to run. E.g. ‘30 */2 1 * *’ configures the schedule to run on the 1st of every month, every 2 hours starting at 00:30. ‘0 0 */5 * *’ configures the schedule to run at midnight (00:00) every 5 days starting on the 5th of each month.
- a comma-separated list of any of the above except the ‘*’ widlcard is also supported ‘2,*/5,8-10’. E.g. ‘0 0 5,*/10,25 * *’ configures the schedule to run at midnight (00:00) every 5th, 10th, 20th, and 25th day of each month.
When two specifications provide conflicting schedule specifications, the specification requesting to execute more often is followed while the other schedule specification is ignored. For example, if the cron schedule is ‘0 0 1 * 1’, then the ‘day of month’ specification and ‘day of week’ are discordant because the former specification requires to run every first day of each month at midnight (00:00) while the latter specification requires to run every Monday at midnight (00:00). The the latter specification is followed. The same concept applies to all other fields.
Example: Daily KPIs
A common pattern is to periodically calculate the fixed KPIs or metrics in a certain interval.
SELECT TD_TIME_FORMAT(TIME, "yyyy-MM-dd") AS day, COUNT(1) AS cnt FROM www_access GROUP BY TD_TIME_FORMAT(TIME, "yyyy-MM-dd"), WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME())
We can schedule the above query to run on a daily basis by setting the following parameters:
- Recurring?: @daily
- Delay (seconds): 3600
The example above aggregates the daily page views from an access log. It makes use of several common UDFs for Presto (or for Hive) to set the proper time range for the aggregation. TD_SCHEDULED_TIME() returns the time when the job gets scheduled to be run. In this case, TD_SCHEDULED_TIME() returns 12 am.
By setting the delay parameter to an hour (3600 seconds), the job launches at 1 AM each day – thus allowing some additional time (an hour in this case) for data to be imported before the job is run.
See Job Result Output for details on how to write the result to an end-system of your choice to track KPIs.
Example: List of Daily Active Users
Another common pattern is to periodically export results to a Treasure Data table, which in turn can be used by other jobs.
SELECT user_id FROM www_access GROUP BY user_id WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1d'), TD_SCHEDULED_TIME())
Like the previous example, we can schedule the above query to run on a daily basis by setting the following parameters:
- Recurring?: @daily
- Delay (seconds): 3600
The preceding example aggregates the list of daily active users from an access log.
You can write this list to another Treasure Data table by setting the following parameters:
- Export Result To: Treasure Data
- Database: YOUR_DATABASE
- Table: YOUR_TABLE
- Mode: Replace
Writing the result into another table is atomic, so you can reference this table from other queries at any time.
Example: Data Mart Generation
One more common pattern is to periodically summarize data from logs, and build a datamart by pushing results to an RDBMS.
SELECT user, code, method, path, agent, host, avg(size) FROM www_access GROUP BY user, code, method, path, agent, host WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1h'), TD_SCHEDULED_TIME())
To schedule this job, set the following parameters:
- Recurring?: @hourly
- Delay (seconds): 600
The preceding example aggregates web request results by user, code, path, agent, host, and average size from an access log on an hourly basis.
See Using TD Data Exchange for information on how to write the result into another RDBMS:
After you push the query results to an external RDBMS, you can slice and dice the results using the BI tool(s) of your choice.
Update the Schedule
Most of the settings of scheduled queries can be updated or reconfigured at anytime. This includes the cron schedule, delay, name, query string, and so on. Alternatively, the ‘next scheduled run time’ (also referred to as ‘Next run’ in the Web Console) is an inferred configuration that is auto updated based on current time, cron schedule, and delay setting.
When a scheduled query is updated, the updated settings apply immediately to the next scheduled run, occurring at ‘next scheduled run time + delay’ (with delay >0, <0 or 0).