Use Treasure Data Workflow to build repeatable data processing pipelines. You can schedule and manage complex tasks, automatically run, and monitor your job flows. Workflow is also used in data segmentation.
Treasure Workflow extends and enhances the capabilities of the highly reputable open source workflow program, Digdag.
Workflow is a key aspect of Treasure Data’s CDP. You create workflows to run efficient queries against your customer data and schedule tasks that feed into audience identification, profiling, and tracking.
Integrate with and organize your organization’s data, run SQL analysis across that data regardless the scale, and then create repeatable insight by saving queries that disseminate data.
Features and Benefits
Treasure Workflow allows you to do the following:
- Create a workflow, which defines the order in which processing tasks will run
- Design with scheduled processing flows in mind.
- Parameterize for easy cloning, sharing, and re-use
- Develop locally, push to Treasure Data to run on a scheduled basis
- Manage error handling more easily
- Configure tasks that can operate nearly every part of the TD system, including:
- Importing data batch jobs using Integrations
- Running Presto and Hive queries
- Create or append to tables
- Result export to other systems
With Treasure Data, you can improve your ability to create internal Data Applications and gain the following benefits:
Organizing your team’s work
As your number of scheduled queries or CRON jobs increase, it becomes harder for organizations to keep track of “what is this one doing?”. Being able to define tasks into organized workflows and projects allows you to immediately know the context that a given query is operating in.
Managing error handling and establishing automated notifications
Many times we see significantly large queries and scripts operating in our customer’s systems. These can be 100s, or even 1000s, of lines long. When errors occur in these SQL queries, it can be incredibly difficult to debug. By breaking your large queries into workflows of smaller dependent queries, it becomes much easier to figure out which part of your logic has broken.
You can to receive notifications when any part of your workflow fails, and thus quickly fix. You can also specify to receive notification of successful workflow runs, or workflow runs that do not complete within specified time boundaries.
Reducing end-to-end latency
By ensuring that dependencies between steps are properly kept, you can create processing pipelines for live data use cases such as reducing KPI updates from daily to hourly to more frequently.
Improving Collaboration and Re-Usability
Parameterization is deeply embedded within Treasure Workflow, so that, as an analyst, you can create a reusable workflow template. You can use your template workflow for future, additional analysis. Stop re-creating SQL statements for similar requests, and start templating your work for easier re-use.
Also, by organizing your queries, it becomes much easier to onboard new employees into your organization or into an ongoing project. You can use Treasure Workflow to group tasks together. New collaborators can more quickly understand the general “why” of a query before digging into its specific logic.
Review workflow syntax to understand how to configure tasks and build repeatable workflows. You use a set of code pieces repeatedly when building and managing workflows. Refer to Syntax.
Set up your workflow environment and create a workflow from the command line interface or the workflow user interface (GUI).
CLI or UI
Work in your preferred environment. You can access Treasure Workflow from a command line interface or from with the Treasure Data Console Workflow UI:
|‘$ mkdir wf_of_saved_queries
*//creates a local directory in which you can create your workflow. When you are ready to push your workflow to Treasure Data, you can also create a project folder. *//
$ cat > saved_queries.dig
*//creates a workflow definition file; in Treasure Data, the workflow is a .dig file *//‘
|Specify Workflow Name, first.
Accept the default or specify a Project name, as a place to store all files associated with your workflow. Several workflows can be saved in a project.
|Select a Workflow Template. A blank template is an empty workspace in which you can enter SQL or another type of script. Treasure Data provides ‘starter’ templates, with placeholder text as well.|
|*//Enter the content of the workflow file *//
|Enter SQL or another type of script in the edit box for the workflow definition file.
|*//Optional: you can create a project folder for your workflows.
Use the command: `td wf workflows `*//
|Click Save & Commit.
Specify the Session. A session specifies the date of the data. The workflow is run against the session.
|$ td wf run saved_queries||Click Run.|
Implement both locally and in the cloud
You have a variety of development approaches that you can take as you develop workflows.
Develop locally with TD CLI > Push into Treasure Data > Manage in Treasure Data GUI
It’s not unusual to create workflows in your local environment and run the same workflows in Treasure Data’s environment. You can store your data in Treasure Data’s cloud-based database and query the data either locally or from within the Treasure Data platform. You might want to create queries and workflows in the cloud but perform analysis, using in house tools, locally. Treasure Data makes it easy for you to move between the two interface options with ease and continuum.
Develop locally with TD CLI > Manage on Github > Autodeploy to Treasure Data GUI to view and monitor
Develop, view and manage in Treasure Data Console and Workflow UI
Treasure Workflow Sample Reference
You can learn how to use Treasure Workflow using the documentation links below:
Building Workflows of TD Processing Steps
- Workflow Introductory Tutorial; Import data, then run a query and export, in a single workflow
- Run Presto and Hive jobs in a single workflow
- Create workflows using saved queries
- How to schedule a workflow to run periodically
- Set notifications for successful execution, expected run-time, or failures
- Grouping & setting parallel execution of tasks
- Configuring workflow credentials (secrets)
Managing Submitted Workflows to Treasure Data
- As mentioned previously, Continuous Deployment of Workflow definitions from GitHub to Treasure Data
- TD Console: Determine the status of workflows submitted to Treasure Data
- Reverting Back to a Previous Workflow Revision
- Troubleshoot errors that occurred in workflows submitted to Treasure Data
Digdag vs Treasure Workflow
Treasure Workflow currently allows for most of the functionality that Digdag, the underlying open source project, allows. But, there are a few exceptions. The following Digdag operators and functionality are not yet enabled when you submit workflows to Treasure Workflow cloud environment:
You can not currently run any arbitrary code scripts including:
sh>for running shell scripts
py>for running python scripts
rb>for running ruby scripts
The following options are not allowed because shared processing and local disk are used:
embulk>for running arbitrary embulk jobs (but you can use
td_load>for importing bulk data into Treasure Data)
download_file:parameter with the
td>& other operators for downloading files locally. Instead you can use the normal Treasure Data result export functionality
Workflows and Profile Sets in Audience Studio
Workflow is used in Audience Studio. You can create workflows for your source data in preparation of creating master segments. Workflow is also used in predictive scoring to refine audiences and segments, and used when you send segmented data to other systems. Treasure Data generates workflows that you can view but are discouraged from editing.