If you are new to using SQL in PySpark you may also want to try out the workbook in the same repository on DataBricks Community Cloud.
There are a few things to bear in mind when using SQL with PySpark:
You need to create a temporary view of any dataframes you want to manipulate with SQL statements.
# to create a table accessible to SQL called "table" # from a Pyspark dataframe called "df" df.createOrReplaceTempView("table")
You also need to remember that PySpark dataframes are immutable, this means that SQL ALTER statements will not work, so to alter data you need to select it into a new dataframe and make the changes with a CASE statement.
# PySpark df_pyspark = df.withColumn("col_4", \ fn.when(df.col_4 == 'two', 'changed').\ otherwise(df.col_4)) df_pyspark.show() # SQL # IMPORTANT: ALTER and SET do not work as dataframes are immutable # here is a fallback df_sql = spark.sql("SELECT col_1, col_2, col_3, \ CASE WHEN col_4 = 'two' THEN 'changed' ELSE col_4 END AS \ col_4 FROM table") df_sql.show() # alternatively you can combine spark.sql and pyspark commands df_sql = spark.sql("SELECT * FROM table").\ withColumn("col_4", fn.when(df.col_4 == 'two', 'changed').\ otherwise(df.col_4)) df_sql.show()
Handling of nulls can call for caution. The PySpark .isNull and .isNotNull functions may be required in some cases. More information can be found here.