You can use td-pyspark to bridge the results of data manipulations in Google Colab with your data in Arm Treasure Data.
Google Colab notebooks make it easy to model with PySpark in Google. PySpark is a Python API for Spark. Treasure Data's td-pyspark is a Python library that provides a handy way to use PySpark and Treasure Data based on td-spark.
To follow the steps in this example, you must have the following Treasure Data items:
- Treasure Data API key
- td-spark feature enabled
Configuring your Google Colab Environment
You create an envelope, install pyspark and td-pyspark libraries and configure the notebook for your connection code.
Create an Envelope in Google Colab
Open Google Colab. Click File > New Python 3 notebook.
Ensure that the runtime is connected. The notebook shows a green check on the top right corner.
Prepare your Environment for the PySpark and TD-PySpark Libraries
Click the icon to add a code cell:
Enter the following code:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install pyspark td-pyspark
Create and Upload the td-spark.conf File
You specify your TD API key and site on your local file system. Create a file as follows:
An example of the format is as follows. You provide the actual values:
spark.td.apikey (Your TD API KEY)
spark.td.site (Your site: us, jp, eu01)
Name the file
td-spark.conf and upload the file by clicking Files > Upload on the Google Colab menu. Verify that the td-spark.conf file is saved in the /content directory.
Run the Installation and Begin Work in Databricks
Run the current cell by selecting the cell and clicking
shift + enter keys.
Create a second code cell and create a script similar to the following code:
os-environ[‘PYSPARK_SUBMIT_ARGS’] = ‘--jars /usr/local/lib/python2.7/dist-packages/td_pyspark/jars/td-spark-assembly.jar --properties-file /content/td-spark.conf pyspark-shell’
os-environ[“JAVA_HOME”] = “/usr/lib/jvm/java-8-openjdk-amd64"
from pyspark import SparkContext
from pyspark.sql import SparkSession
builder = SparkSession.builder.appName("td-pyspark-test")
td = td_pyspark.TDSparkContextBuilder(builder).build()
df = td.table("sample_datasets.www_access").within("-10y").df()
TDSparkContextBuilder is an entry point to access td_pyspark's functionalities. As shown in the preceding code sample, you read tables in Treasure Data as data frames:
df = td.table("tablename").df()
You see a result similar to the following:
Your connection is working.
Interacting with Treasure Data from Google Colab
In Google Colab, you can run select and insert queries to Treasure Data or query back data from Treasure Data. You can also create and delete databases and tables.
In Google Colab, you can use the following commands:
Read Tables as DataFrames
.df() your table data is read as Spark's DataFrame. The usage of the DataFrame is the same with PySpark. See also PySpark DataFrame documentation.
df = td.table("sample_datasets.www_access").df()
Submit Presto Queries
If your Spark cluster is small, reading all of the data as in-memory DataFrame might be difficult. In this case, you can use Presto, a distributed SQL query engine, to reduce the amount of data processing with PySpark.
q = td.presto("select code, * from sample_datasets.www_access") q.show()
q = td.presto("select code, count(*) from sample_datasets.www_access group by 1")
Create or Drop a Database
Upload DataFrames to Treasure Data
To save your local DataFrames as a table, you have two options:
- Insert the records in the input DataFrame to the target table
- Create or replace the target table with the content of the input DataFrame
Checking Google Colab in Treasure Data
You can use td toolbelt to check your database from a command line. Alternatively, if you have TD Console, you can check your databases and queries. Read about Database and Table Management.