With td-spark driver you can access data sets in Arm Treasure Data using Apache Spark.
|This feature is in BETA stage, and the access is disabled by default. We're looking for customers who know Apache Spark well and are willing to try this feature to provide feedback. If you're interested, please contact support.|
|Recommendations regarding use For fastest data access, and lowest data transfer costs, we recommend that you set-up your spark cluster in the AWS us-east region. Data Transfer costs may become quite high if using other AWS regions or processing environments.|
- TD Spark FAQs
- Using TD Spark on Amazon Elastic MapRecue (EMR)
- For launching your own Spark cluster to use td-spark.
The driver is available from here: td-spark-assembly-latest.jar. This driver is recommended for use with Spark 2.1.0 or higher.
- June 11, 2018
- Upgrade to Spark 2.3.0
- Support expiration_sec option, e.g., td.table("...").withExpierationSec(7200)
- Support time window string like td.table("...").within("-7d")
- Support td.table("...").fromUnixTime(...)/untilUnixTime(...).
- Oct 31st, 2017
- Show td-spark version.
- Oct 5th, 2017
- Fixes a bug when reading tables containing null values.
- July 14, 2017
- Support full time range format, e.g.,
td.df("(database).(table)", from = "2017-07-14 01:00:00 PDT", to="2017-07-15 02:00:00 PDT")
- Optimized DataFrame reader by using memory-efficient UnsafeRow class
- Improved the performance and memory consumption when reading a single table schema
- Fixed a bug that showed null values when reading columns that have alias names
- Jan 25, 2017
- Upgraded to Spark 2.1.0
- Rename guava library package inside the assembly jar
- Upgrade to Airframe 0.9, wvlet-log 1.1
- Nov 17, 2016
- Private alpha release
- Added basic funcationalities for using TD dataset inside Spark
- Reading mpc1 files
- listing databases and table schemas
- Run Presto/Hive queries
- Suppport for SparkSQL
- PySpark support