SQL to PySpark Conversion Cheatsheet

Following on from my pandas to pyspark cheatsheet here is another cheatsheet to help convert SQL queries into PySpark dataframe commands. Like the last one it can be downloaded from GitHub.

If you are new to using SQL in PySpark you may also want to try out the workbook in the same repository on DataBricks Community Cloud.

There are a few things to bear in mind when using SQL with PySpark:

You need to create a temporary view of any dataframes you want to manipulate with SQL statements.

# to create a table accessible to SQL called "table" 
# from a Pyspark dataframe called "df"


You also need to remember that PySpark dataframes are immutable, this means that SQL ALTER statements will not work, so to alter data you need to select it into a new dataframe and make the changes with a CASE statement.

# PySpark
df_pyspark = df.withColumn("col_4", \
  fn.when(df.col_4 == 'two', 'changed').\

# IMPORTANT: ALTER and SET do not work as dataframes are immutable
# here is a fallback
df_sql = spark.sql("SELECT col_1, col_2, col_3, \
  CASE WHEN col_4 = 'two' THEN 'changed' ELSE col_4 END AS \
  col_4 FROM table")

# alternatively you can combine spark.sql and pyspark commands
df_sql = spark.sql("SELECT * FROM table").\
  withColumn("col_4", fn.when(df.col_4 == 'two', 'changed').\

Handling of nulls can call for caution. The PySpark .isNull and .isNotNull functions may be required in some cases. More information can be found here.

Some more complicated functions may also require you to delve into further documentation on PySpark and SQL. Good starting points can be found here for Spark SQL and here for Pyspark dataframes.



Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.