Two Handy PySpark Hints

A couple of quick handy hints people may find useful

Checking whether a Delta file exists

The delta format is an excellent way to store PySpark output. It is built on Parquet and is partionable, streamable and can be operated on like an SQL database table. However sometimes when reading in a delta file you may want to check the file exists before reading it in. This is important because failing to do so so can result in errors in the underlying JVM which PySpark cannot handle via try-except logic.

Fortunately the code needed to implement this check is very simple. However the documentation detailing how to do this is a little obscure and hard to track down, so here it is:

# assuming we have already set up a spark context called spark
from delta.tables import DeltaTable
 
if DeltaTable.isDeltaTable(sparkSession = spark, identifier = path):
  df = spark.read.format("delta").load(path)
else:
  df = None

Dealing with non-deterministic functions as PySpark UDFs

As you may or may not be aware, PySpark assumes by default that User Defined Functions (UDFs) are deterministic. This allows optimisation where spark can cache and reapply results it has already applied for a given set of UDF inputs to all the rows where those inputs occur. However this means that if you want to add a random component to a UDF you need to specifically inform PySpark of your intent so that a separate non-deterministic value is calculated for every row. This is easy to do once you know that it should be done:

from pyspark.sql import functions as fn

# create a udf which will behave non-deterministically
random_uuid_udf = fn.udf(
  lambda: str(uuid.uuid4()), T.StringType()
).asNondeterministic()
 
# use it to create a column
df = df.withColumn('uuid', random_uuid_udf())

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.