A Short Snippet for Converting PySpark Schema

Pyspark schema can be labourious to write. One approach to this issue was discussed previously. However DDL definitions may not meet all needs. Particularly where you have an example dataframe, it would be great to simply extract and reuse the schema, modifying it as required. A dataframes’s schema is easy enough to extract with df.schema. However if you need to make even small alterations this is not suitable. We could print the schema and then rewrite it, but unfortunately the output from print(df.schema) is not actually formatted correctly forcutting and pasting directly in PySpark.

Therefore a small script to convert str(df.schema) to a usable PySpark schema definition can be useful. Here is a basic example of such a script. Note that it is not guaranteed to work in all cases, but should generally get you much closer to useable Pyspark code which you can then easily and quickly amend by hand to meet your needs.


# this is how to get the schema of an existing dataframe df as a string
schema = str(df.schema)

# define replacement dictionary 

# assumes "from pyspark.sql import types as T"
replacements = {
    # add brackets to basic types
    'StringType,': 'T.StringType(),',
    'IntegerType,': 'T.IntegerType(),',
    'IntegerType,': 'T.IntegerType(),',
    'TimestampType,': 'T.TimestampType(),',
    'FloatType,': 'T.FloatType(),',
    'LongType,': 'T.LongType(),',
    'DoubleType,': 'T.DoubleType(),',
    'BooleanType,': 'T.BooleanType(),',
    # add prefixes to complex types
    'ArrayType': 'T.ArrayType',
    'MapType': 'T.MapType',
    'StructType': 'T.StructType',
    # capitalise booleans
    ',true)': ',True)',
    ',false)': ',False)',
    # create pythonic list brackets
    'List(': '[',
    ')))': ')\n])',
}

# perform replacements
for key, replacement in replacements.items():
    schema = schema.replace(key, replacement)

# now work through and enclose the column names in quotes
parts = schema.split('StructField(')
whole = []
for count, part in enumerate(parts):
    # the first item will not be a StructField
    if count != 0:
        fragments = part.split(',')
        defragment = []
        # for all other elements wrap the field name in speech marks
        for fcount, fragment in enumerate(fragments):
            if fcount == 0:
                fragment = "'" + fragment + "'"
            defragment.append(fragment)
        part = ", ".join(defragment)
    whole.append(part)
# we insert newlines and tabs at this point for a more readable schema
schema_out = '\n\tStructField('.join(whole)

# finally print our schema so we can use it how we please
print(schema_out)

A Short Snippet for Converting PySpark Schema

Published by justinmatters

Leave a Reply Cancel reply