If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can dodge this by inferring the schema. An example of this is if we are reading in json. However, in other cases like streaming dataframes this is not possible.
The good news is that as well as carefully built schema objects you can also convert DDL-formatted strings to schema. This can often be simpler and quicker since the format is less verbose.
First import types as for the usual method of creating a schema:
import pyspark.sql.types as T
Now usually the way to create a schema at this point is:
schema = T.StructType([ T.StructField("col1", T.StringType(), True), T.StructField("col2", T.IntegerType(), True), T.StructField("col3", T.ArrayType(T.TimestampType()), True) ])
However if we use a DDL schema, this is a lot easier to write:
ddl_schema_string = "col1 string, col2 integer, col3 array<timestamp>" ddl_schema = T._parse_datatype_string(ddl_schema_string)
There are some caveats. Firstly, it seems that you can only create nullable columns this way. Secondly this is using a helper function out of pyspark.sql.types
so there is a risk that the functionality might change at some point as it is supposed to be used internally to the class. However for now it is a nice PySpark substitute for the Scala StructType.fromDDL()
function
The function we are using can be seen at https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/types.html and the Scala code that can read DDLs which put me on the trail of something similar for PySpark is here https://sparkbyexamples.com/spark/spark-schema-explained-with-examples/
I have created a tiny workbook to demonstrate usage that should be available for the next few months, but it really is as simple as the snippet above https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/968100988546031/1656753038112086/8836542754149149/latest.html
Finally some more information about structured streaming on you tube
One Reply to “Easier Way to Define Schema for PySpark”