Use Treasure Data Workflow to build repeatable data processing pipelines. You can schedule and manage complex tasks, automatically run, and monitor your job flows. Workflow is also used in data segmentation.
Treasure Workflow extends and enhances the capabilities of the highly reputable open source workflow program, Digdag.
|On March 27, 2018 we released for General Availability our new Workflow User Interface. The new user interface is now available for all accounts.|
Read more about what is changing and what to expect during General Availability of Workflow UI.
Workflow is a key aspect of Treasure Data’s CDP. You create workflows to run efficient queries against your customer data and schedule tasks that feed into audience identification, profiling, and tracking.
Integrate with and organize your organization’s data, run SQL analysis across that data regardless the scale, and then create repeatable insight by saving queries that disseminate data.
Features and Benefits
Treasure Workflow allows you to do the following:
- Create a workflow, which defines the order in which processing tasks will run
- Design with scheduled processing flows in mind.
- Parameterize for easy cloning, sharing, & re-use
- Develop locally, push to Treasure Data to run on a scheduled basis
- Manage error handling more easily
- Configure tasks that can operate nearly every part of the TD system, including:
- Importing data batch jobs using Data Connector
- Running Presto & Hive queries
- Create or append to tables
- Result export to other systems
With Treasure Data, you can improve your ability to create internal Data Applications and gain the following benefits:
Organizing your team’s work
As your number of scheduled queries or CRON jobs increase, it becomes harder for organizations to keep track of “what is this one doing?”. Being able to define tasks into organized workflows and projects allows you to immediately know the context that a given query is operating in.
Managing error handling and establishing automated notifications
Many times we see significantly large queries and scripts operating in our customer’s systems. These can be 100s, or even 1000s, of lines long. When errors occur in these SQL queries, it can be incredibly difficult to debug. By breaking your large queries into workflows of smaller dependent queries, it becomes much easier to figure out which part of your logic has broken.
You can to receive notifications when any part of your workflow fails, and thus quickly fix. You can also specify to receive notification of successful workflow runs, or workflow runs that do not complete within specified time boundaries.
Reducing end-to-end latency
By ensuring that dependencies between steps are properly kept, you can create processing pipelines for live data use cases such as reducing KPI updates from daily to hourly to more frequently.
Improving Collaboration and Re-Usability
Parameterization is deeply embedded within Treasure Workflow, so that, as an analyst, you can create a reusable workflow template. You can use your template workflow for future, additional analysis. Stop re-creating SQL statements for similar requests, and start templating your work for easier re-use.
Also, by organizing your queries, it becomes much easier to onboard new employees into your organization or into an ongoing project. You can use Treasure Workflow to group tasks together. New collaborators can more quickly understand the general “why” of a query before digging into its specific logic.
Command line interface and User interface
Work in your preferred environment. You can access Treasure Workflow from a command line interface or from with the Treasure Workflow UI
|‘$ mkdir wf_of_saved_queries
*//creates a local directory in which you can create your workflow. When you are ready to push your workflow to Treasure Data, you can also create a project folder. *//
$ cat > saved_queries.dig
*//creates a workflow definition file; in Treasure Data, the workflow is a .dig file *//‘
|Specify Workflow Name, first.
Accept the default or specify a Project name, as a place to store all files associated with your workflow. Several workflows can be saved in a project.
|Select a Workflow Template. A blank template is an empty workspace in which you can enter SQL or another type of script. Treasure Data provides ‘starter’ templates, with placeholder text as well.|
|*//Enter the content of the workflow file *//
|Enter SQL or another type of script in the edit box for the workflow definition file.
|*//Optional: you can create a project folder for your workflows.
Use the command: `td wf workflows `*//
|Click Save & Commit.
Specify the Session. A session specifies the date of the data. The workflow is run against the session.
|$ td wf run saved_queries||Click Run.|
Implement both locally and in the cloud
You have a variety of development approaches that you can take as you develop workflows.
Develop locally with TD CLI > Push into Treasure Data > Manage in Treasure Data GUI
It’s not unusual to create workflows in your local environment and run the same workflows in Treasure Data’s environment. You can store your data in Treasure Data’s cloud-based database and query the data either locally or from within the Treasure Data platform. You might want to create queries and workflows in the cloud but perform analysis, using in house tools, locally. Treasure Data makes it easy for you to move between the two interface options with ease and continuum.
Develop locally with TD CLI > Manage on Github > Autodeploy to Treasure Data GUI to view and monitor
Develop, view and manage in Treasure Data Console and Workflow UI
What you must know and do
Run through the QuickStart guide to set up and complete your first workflow. Review workflow syntax in order to understand how to configure tasks and build repeatable workflows.
You can quickly set up your workflow environment and create a workflow either from the command line interface or the workflow user interface (GUI).
You use a set of code pieces repeatedly when building and managing workflows: Refer to Syntax.
How to use Treasure Workflow
You can learn how to use Treasure Workflow using the documentation links below:
Building Workflows of TD Processing Steps
- Import data, then run a query & export, in a single workflow
- Run Presto & Hive jobs in a single workflow
- Build a workflow using saved queries
- How to schedule a workflow to run periodically
- Set notifications for successful execution, expected run-time, or failures
- Grouping & setting parallel execution of tasks
- Configuring workflow credentials
Managing Submitted Workflows to Treasure Data
- As mentioned previously, Continuous Deployment of Workflow definitions from GitHub to Treasure Data
- Determine the status of workflows submitted to Treasure Data
- Reverting Back to a Previous Workflow Revision
- Investigate errors that occurred in workflows submitted to Treasure Data
Digdag vs Treasure Workflow
Treasure Workflow currently allows for most of the functionality that Digdag, the underlying open source project, allows. But, there are a few exceptions. The following Digdag operators and functionality are not yet enabled when you submit workflows to Treasure Workflow cloud environment:
First, you can not currently run any arbitrary code scripts. These include:
sh>for running shell scripts
py>for running python scripts
rb>for running ruby scripts
Additionally, the following options are not allowed because shared processing and local disk are used:
embulk>for running arbitrary embulk jobs (but you can use
td_load>for importing bulk data into Treasure Data)
download_file:parameter with the
td>& other operators for downloading files locally. Instead you can use the normal Treasure Data result export functionality
We are considering adding the functions to our hosted version of Digdag, Treasure Workflow. If you are interested in the functions, let us know!
Workflows in Audience Suite
Workflow is used in Audience Suite. You can create workflows for your source data in preparation of creating audiences. Workflow is also used in predictive scoring to refine audiences and segments, and used when you send segmented data to other systems. Treasure Data generates workflows that you can view but are discouraged from editing.
Feedback and Feature Requests
We look forward to hearing your ideas on how to improve Treasure Workflows. You can submit your ideas by speaking reaching out to us the private beta Slack channel, or by submitting it on the Treasure Data Idea Forum.
If you have any questions, contact our support team.