Two handy hints for dealing with messy input data

Data coming in from external sources can be rather messy. In this article we will look at a couple of handy hints to deal with common problems with incoming data sources. Missing columns and Null values.

Null data illustration

Missing Columns

Missing columns can arise for a number of reasons. Possibly a new CSV or other file is missing a column by mistake. Another possibility is a dynamic data source which is stored as a series of JSON files. In this case parameter keys may only be present if at least one record in the data uses that key. There are a variety of ways to work round this. We could specify a structure for the incoming data when converting to a data frame. However what if a given column is not always present in a dataframe and we want to check for presence.

Well at a simple level this can be done easily

columns = df.columns
target_column = "column_name"

if target_column in columns:
  print("present") # proceed as normal
else:
  print ("absent") # absence handling code here

However what if things are more complicated. We might need to check for the presence of a column inside a struct. Well we can use a function like this which works with structured columns too.

from pyspark.sql.utils import AnalysisException
def column_found(df, column):
  try:
    df[column]
    return True
  except AnalysisException:
    return False

Handling Nulls

Another problem is how to handle nulls in columns. Now clearly we can fill nulls using:

df.fillna(value, subset=[columns])

However sometimes nulls are telling us something important and we may not want to drop them. This might apply if we later need to perform an SQL join or other operation, which may need to exclude the nulls.

Fortunately we have a couple of methods to handle nulls. Within PySpark clauses such as filter and when we can use .isNull() and .isNotNull()in conjunction with other conditions. However there is an additional little advertised source of assistance. PySpark has a null safe equality operator.

<=> is the null safe equality operator. This has the following pattern of responses which can be useful as part of joins or other operations under certain circumstances:

operand_1	operand_2	standard_equality	null_equality
1	1	true	true
1	2	false	false
1	null	null	false
null	1	null	false
null	null	null	true

However this operator is only available as part of SQL expressions in Pyspark. This means that you will need to invoke a Pyspark sqlContext in order to be able to make use of it.


# code snippet showing use of &amp;amp;amp;amp;amp;amp;amp;lt;=&amp;amp;amp;amp;amp;amp;amp;gt; operator

df.registerTempTable("null_table")

sql_null_results = sqlContext.sql(
"""
SELECT operand_1,
operand_2,
operand_1 = operand_2 AS standard_equality,
operand_1 &amp;amp;amp;amp;amp;amp;amp;lt;=&amp;amp;amp;amp;amp;amp;amp;gt; operand_2 AS null_safe_equality
FROM null_table
"""
)

display(sql_null_results)

It would be useful on occasion to have a similar null safe equality available to act directly on spark dataframes without needing to invoke Spark SQL. Fortunately we can do this quite easily since Python’s equality operator == is already null safe. Lets create a null safe User Defined Function (UDF)


# we can create a null safe equality UDF

def null_safe_equality(a, b):
return a == b

udf_null_safe_equality = fn.udf(
  null_safe_equality, 
  T.BooleanType(),
)

pyspark_null_results = null_df.withColumn(
  "standard_equality", 
  fn.col('operand_1') == fn.col('operand_2')
).withColumn(
  "null_equality", 
  udf_null_safe_equality('operand_1', 'operand_2')
)

Note that the == within the UDF will be null safe in a way that the == operator used directly within pySpark operations is not. For a demonstration of this (and the failure of direct use of <=> ) see the example workbook I have created at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/968100988546031/764888478920499/8836542754149149/latest.html

A fuller discussion of null handling in PySpark can be found at https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html

Two handy hints for dealing with messy input data

Missing Columns

Handling Nulls

Published by justinmatters

Leave a Reply Cancel reply