This article explains how to import data from Google Sheets to Arm Treasure Data using embulk-input-google_spreadsheets input plugin.
Prerequisites
- Basic knowledge of Treasure Data.
- Basic knowledge of Embulk
- Embulk and embulk-output-td plugin installed on your machine.
Step 1: Install embulk-input-google_spreadsheets plugin
To install embulk-input-mysql plugin, run the following command:
$ embulk gem install embulk-input-google_spreadsheets
Step 2: Obtain required Google Api credentials.
In order for embulk to connect to the Google Sheets Api there a different authentication options available. This example will use the authorized_user
method which requires a json key file. You can get the key file from the Google Api console. The key file contains the credentials to allow connection to Google Sheets api, it has three fields required for the embulk plugin to run.
* client_id * client_secret * refresh_token
You will need to download the key file from an existing Google api project. If you have not created a project you can do so using this wizard provided by Google which will allow you to create a project and turn on the Google sheets API. Google Sheets Api Wizard.
Next, Go to Credentials –> Create Credentials –> OAuth ClientID –> Web Application. Enter a name then ‘Create’. The next screen will show you the client ID and client secret. Copy them as you will need them for the json key file. Completing these steps will provide you with the client_id
and client_secret
.
The remaining credential you need to get is the refresh token. One way to get the refresh token is to use the Google OAuth 2.0 Playground, available here: Google OAuth 2.0 Playground
First, in the Top right select the Gear Icon and make sure to check the Use your own OAuth credentials. In the
OAuth Client IDand the
OAuth Client secret`, insert the credentials you got from api console.
From the Api list on the left select the Google Sheets API v4
. Then proceed to the Step 2. - Exchange authorization code for tokens
. Press the Exchange authorization code for tokens
button which will fill the Refresh Token
and Access Token
fields. The refresh token field is the value you should note down for the json key file embulk will require. With all components of the json key file collected you are ready to begin the embulk configuration. The format of the json key file is:
{ "client_id": "xxxxxxxxxx", "client_secret": "xxxxxxxxxx", "refresh_token": "xxxxxxxxxx" }
Step 3: Create Embulk configuration file
Using your favorite text editor, create the embulk config file defining input(Google Sheets) and ouput (TD) parameters.
In the following example, for simplicity’s sake, the required json key file is used in-line). For further details about additional parameters and other examples are available, refer to Embulk Input Google Sheets.
Note the fields that are required in order for the plugin to run (for example spreadsheets_url
and worksheet_title)
. Fill in the relevant details for the sheet you are trying to import. The same applies for the output
section which outputs to Treasure Data.
The example assumes that you have already created a relevant table in Treasure Data that matches the details used in the config file, i.e. database
, would be the database created for the data, table
would be the within that database to receive the data.
in: type: google_spreadsheets auth_method: authorized_user json_keyfile: content: | { "client_id": "xxxxxxxxxx", "client_secret": "xxxxxxxxxx", "refresh_token": "xxxxxxxxxx" } spreadsheets_url: https://docs.google.com/spreadsheets/d/1xP_ZBA-Or4sJ63Zj99-JreuAFGn7Y2gcUKd1vtwSqwM/edit#gid=0 worksheet_title: title start_row: starting row default_timezone: timezone to be used null_string: '\N' default_typecast: strict columns: - {name: color, type: string} - {name: size, type: string} - {name: type, type: string} out: type: td apikey: xxxxxxxxxxxx endpoint: api.treasuredata.com database: dbname table: tblname time_column: datecolumn mode: replace #by default mode: append is used, if not defined. Imported records are appended to the target table with this mode. #mode: replace, replaces existing target table default_timestamp_format: '%d/%m/%Y'
Best Practice: Add the "auto_create_table: true" parameter to the load.yml, so that tables that do not exist are automatically. |
This is a sample of the auto_create_table parameter in a .yml file.
out: type: td apikey: <your apikey> endpoint: api.treasuredata.com database: dbname table: tblname time_column: created_at auto_create_table: true mode: append
Alternative: If you either: 1) must add a database or 2) do not add the auto_create_table parameter in a .yml file and must add a table, run the following TD commands: |
$ td database:create dbname $ td table:create dbname tblname
You can also create the database and table using Treasure Data Console
Step 4: Execute Load Job
Finally, issue the import job by running the following command:
$embulk run load.yml
The run time of the import job varies depending on the size of the data you are importing. After the job is done, the data should reflect in the TD database and table used in the embulk configuration file.
Comments
0 comments
Please sign in to leave a comment.