Arm Treasure Data has a scheduler feature that supports periodic query execution.
Prerequisites
- Basic knowledge of Treasure Data, including TD toolbelt
- A table with some data.
TD Toolbelt Create Schedule Syntax
A new schedule can be created using the td sched:create command. The name of the schedule, cron-style schedule, query, and the database name are required.
$ td sched:create <name> <cron> [sql]
options:
-d, --database DB_NAME use the database (required)
-t, --timezone TZ name of the timezone. Only extended timezones are supported.
-D, --delay SECONDS delay time of the schedule
-r, --result RESULT_URL write result to the URL (see also result:create subcommand)
-u, --user NAME set user name for the result URL
-p, --password ask password for the result URL
-P, --priority PRIORITY set priority
-q, --query PATH use file instead of inline query
-R, --retry COUNT automatic retrying count
-T, --type TYPE set query type (hive)
For example:
Setting the Priority of Scheduled Jobs using TD Toolbelt
To change the priority of a job, use:
td sched:update <job_name> -P <#>
Priority | syntax option value |
very high | -P 2 |
normal | -P 0 |
very low | -P -2 |
For example:
td sched:update 111_beta_query_01 -P 2
Example: Periodic Aggregation
A common pattern is to periodically summarize data from a main table into another table. The example aggregates web request results from an access log on an hourly basis. It makes use of several common functions for Presto (or for Hive) to set the proper time range for the aggregation.
SELECT user, code, method, path, agent, host, avg(size) FROM www_access GROUP BY user, code, method, path, agent, host WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1h'), TD_SCHEDULED_TIME())
To finish setting up the query to run on an hourly basis, create a scheduled query :
$ td sched:create \
hourly_agg \ "@hourly" \ -d testdb \ -D 1800 \ "SELECT user, code, method, path, agent, host, avg(size) FROM www_access GROUP BY user, code, method, path, agent, host WHERE TD_TIME_RANGE(time, TD_TIME_ADD(TD_SCHEDULED_TIME(), '-1h'), TD_SCHEDULED_TIME())"
In the preceding example, @hourly
is the same as the cron 0 * * * *
. The -D
parameter allows you specify a delay time in seconds. In the example, a 1800 seconds (or 30 minutes delay) is set. The job will execute 30 minutes after each hour.
Viewing Schedules using TD Toolbelt
To see the list of existing scheduled entries, use td sched:list
.
Viewing the History of a Job
td sched:history
shows the job history of a schedule entry. To investigate the results of each individual job, use td job <jobid>
.
To get a list of jobs that run according to a defined schedule, use:
td sched:list
To view the history of one of the jobs with a defined schedule, use:
td sched:history <name> [max]
Example:
Update the Schedule
Most of the settings for scheduled queries can be updated or reconfigured at any time. The ‘next scheduled run time’ (also referred to as ‘Next’ or ‘Next schedule’) is an inferred configuration that is auto updated based on the current time of your browser's timezone, cron schedule, and delay setting.
When a scheduled query is updated, the updated settings will apply immediately to the next scheduled run, occurring at ‘next scheduled run time + delay’ (with delay >0, <0 or 0).
td sched:update <name>
example:
$ td sched:update sched1 -s "0 */2 * * *" -d my_db -t "Asia/Tokyo" -D 3600
options:
-n, --newname NAME change the schedule's name
-s, --schedule CRON change the schedule
-q, --query SQL change the query
-d, --database DB_NAME change the database
-r, --result RESULT_URL change the result target (see also result:create subcommand)
-t, --timezone TZ name of the timezone. Only extended timezones are supported.
-D, --delay SECONDS change the delay time of the schedule
-P, --priority PRIORITY set priority
-R, --retry COUNT automatic retrying count
-T, --type TYPE set query type (hive)
Unschedule the Job
To unschedule a specific job, use:
"" instead of "*****"
For example:
$ td sched:update sched1 -s "" -d my_db -t "Asia/Tokyo" -D 3600
or
$ td sched:create non_scheduled_query "" "select count(*) from www_access" -d sample_datasets
Delete the Schedule
To remove the schedule from a specific job, use:
td sched:delete <name>
For example:
$ td sched:delete hourly_count
Comments
0 comments
Please sign in to leave a comment.