Easier Way to Define Schema for PySpark

If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can dodge this by inferring the schema. An example of this is if we are reading in json. However, in other cases like streaming dataframes this is not possible.

Gramophone line art
hardware for an early streaming application

The good news is that as well as carefully built schema objects you can also convert DDL-formatted strings to schema. This can often be simpler and quicker since the format is less verbose.

First import types as for the usual method of creating a schema:

import pyspark.sql.types as T 

Now usually the way to create a schema at this point is:

schema = T.StructType([
T.StructField("col1", T.StringType(), True),
T.StructField("col2", T.IntegerType(), True),
T.StructField("col3", T.ArrayType(T.TimestampType()), True)
])

However if we use a DDL schema, this is a lot easier to write:

ddl_schema_string = "col1 string, col2 integer, col3 array<timestamp>"
ddl_schema = T._parse_datatype_string(ddl_schema_string)

There are some caveats. Firstly, it seems that you can only create nullable columns this way. Secondly this is using a helper function out of pyspark.sql.types so there is a risk that the functionality might change at some point as it is supposed to be used internally to the class. However for now it is a nice PySpark substitute for the Scala StructType.fromDDL() function

The function we are using can be seen at https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/types.html and the Scala code that can read DDLs which put me on the trail of something similar for PySpark is here https://sparkbyexamples.com/spark/spark-schema-explained-with-examples/ 

I have created a tiny workbook to demonstrate usage that should be available for the next few months, but it really is as simple as the snippet above https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/968100988546031/1656753038112086/8836542754149149/latest.html

Finally some more information about structured streaming on you tube

 

One Reply to “Easier Way to Define Schema for PySpark”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.