With td-spark driver you can access data sets in Treasure Data using Apache Spark.
|This feature is in BETA stage, and the access is disabled by default. We're looking for customers who know Apache Spark well and are willing to try this feature and give feedback to our team. If you're interested, please contact support.|
|Recommendations regarding use For fastest data access, and lowest data transfer costs, we recommend that you set-up your spark cluster in the AWS us-east region. Data Transfer costs may become quite high if using other AWS regions or processing environments. While we anticipate giving this feature to all our customers by default, it may come at an additional expense once fully released if we see data access patterns that create additional expenses than anticipated.|
- TD Spark FAQs
- General Usage of td-spark
- Using TD Spark on Amazon Elastic MapRecue (EMR)
- For launching your own Spark cluster to use td-spark.
The driver is avaialable from here: td-spark-assembly-latest.jar. This driver is recommended for use with Spark 2.1.0 or higher.
- Oct 31st, 2017
- Show td-spark version.
- Oct 5th, 2017
- Fixes a bug when reading tables containing null values.
- July 14, 2017
- Support full time range format, e.g.,
td.df("(database).(table)", from = "2017-07-14 01:00:00 PDT", to="2017-07-15 02:00:00 PDT")
- Optimized DataFrame reader by using memory-efficient UnsafeRow class
- Improved the performance and memory consumption when reading a single table schema
- Fixed a bug that showed null values when reading columns that have alias names
- Jan 25, 2017
- Upgraded to Spark 2.1.0
- Rename guava library package inside the assembly jar
- Upgrade to Airframe 0.9, wvlet-log 1.1
- Nov 17, 2016
- Private alpha release
- Added basic funcationalities for using TD dataset inside Spark
- Reading mpc1 files
- listing databases and table schemas
- Run Presto/Hive queries
- Suppport for SparkSQL
- PySpark support