This article explains how to import files from your AWS S3 bucket to Arm Treasure Data using embulk-input-s3 input plugin.
- Basic knowledge of Treasure Data.
- Basic knowledge of Embulk
- Embulk and embulk-output-td plugin installed on your machine.
Step 0: Install embulk-input-s3 plugin
To install embulk-input-s3 plugin, run the following command:
$ embulk gem install embulk-input-s3
Step 1: Create seed configuration file
Using your favorite text editor, create embulk config file (for eg:seed.yml) defining input(S3) and ouput(TD) parameters. Example:
in: type: s3 bucket: s3bucket path_prefix: path/to/sample_file # path of *.csv or *.tsv file on your s3 bucket access_key_id: xxxxxxxxxx secret_access_key: xxxxxxxxxxx out: type: td apikey: xxxxxxxxxxxx endpoint: api.treasuredata.com database: dbname table: tblname time_column: datecolumn mode: replace #by default mode: append is used, if not defined. Imported records are appended to the target table with this mode. #mode: replace, replaces existing target table default_timestamp_format: '%d/%m/%Y'
For further details about additional parameters available for embulk-input-s3, refer Embulk Input S3
Step 2: Guess Fields (Generate load.yml)
Embulk guess option uses
seed.yml to read the target file and automatically guesses the column types/settings and creates a new file
load.yml with this information.
$embulk guess seed.yml -o load.yml
Now, you may preview the data using
embulk preview load.yml command. If any of the column types or data seems incorrect you may edit
load.yml file directly and preview again to verify. If
guess option doesn’t yield satisfactory results, you may change parameters in
load.yml according to your requirement manually using CSV/TSV parser plugin options.
|You will need to create the database and table in TD, prior to executing the load job.|
To do this:
$ td database:create dbname $ td table:create dbname tblname
Or alternatively, you may create the database and table via Treasure Data Console
Step 3: Execute Load Job
Finally, issue the import job by running the following command:
$embulk run load.yml
It may take few mins to hours for the job to complete, depending on the size of the data.