Coalesce for Combining Columns in Pyspark

We can frequently find that we want to combine the results of several calculations into a single column. For instance perhaps we have various data sources with incomplete data and we want to take the best non-null value we can find from the data sources. Or perhaps we want to use some form of logic to decide which of a number of columns we can trust and select that value.

In these cases the coalesce function is extremely useful. It can be even more powerful when combined with conditional logic using the PySpark when function and otherwise column operator.

Basic Coalesce

Lets start with a simple example. Suppose we have a dataframe like this:

We apply the following coalesce statement to it:


df_simple_coalesce = df.withColumn(
    'coalesced',
    # a simple coalesce
    coalesce(
        col('col_1'),
        col('col_2'),
        col('col_3'),
    )
).withColumn(
    'coalesced_with_default',
    # note the use of an ending literal to ensure no nulls
    coalesce(
        col('col_1'),
        col('col_2'),
        col('col_3'),
        lit(5),
    )
)

This gives us the following:

OK so what has happened here? Well the coalesce function takes the first non-null value from its list of values. The withColumn command then assigns this value to our new column. We can see that the left most non-null column’s value is assigned in each case. Where we assigned a final literal, this is used to fill rows where no non-null values are found, where we did not the value remains null.

Using When Conditions with Coalesce

Rather than simply coalescing the values, lets use the same input dataframe but get a little more advanced. We add a condition to one of the coalesce terms:

# coalesce statement used in combination with conditional when statement
df_when_coalesce = df.withColumn(
    'coalesced_when', 
    coalesce(
        # no otherwise so nulls preserved
        when(col('col_1') > 1, 5), 
        col('col_2'), 
        col('col_3')
    )
).withColumn(
    'coalesced_when_with_otherwise', 
    # otherwise overwrites all nulls in col_1 
    coalesce(
        when(col('col_1') > 1, 5).otherwise(6), 
        col('col_2'), 
        col('col_3')
    )
)

This now results in:

We see that in the column coalesced_when that where the value in col_1 was not null and exceeded 1, the value of 5 was assigned by the when statement. However in the column coalesced_when_with_otherwise the otherwise statement applies a value of 6 not just where col_1 is less than or equal to 1 but also where it was null. Care is needed when using otherwise with coalesced when functions to avoid unforseen effects.

Generating When Conditions with a List Comprehension

Occasionally we may want to generate a whole series of conditions from either a dictionary or a list. This can be useful to handle situations where we may want to be able to adjust our logic at runtime by passing a custom dictionary or list to the coalesce statement.

Lets consider a simple case to illustrate:

# example of using a list comprehension to compile whens
columns = [1,2,3]
whens = [
    when(
        col(f'col_{c}') == c, 0
    ).otherwise(
        col(f'col_{c}')
    ) for c in columns
]
# we can also add a catchall condition to the list
whens += [lit(-1)]

df_multiple_whens = df.withColumn(
    'coalesced_whens_from_list',
    coalesce(
        # use star syntax to apply our list of conditions in order
        *whens
    )
)

On the same dataframe as before we get:

So what is happening here? Well we set up a list of conditions that will assign a value of zero if the first column with a non-null value has a value equal to the column number. Additionally if no non-null value is found we assign a value of -1. In all other cases we take the first valid column value.

Note that we can define both columns to be handled and conditions to be sought via the comprehension. This gives us considerable flexibility and power.

An example workbook demonstrating this can be found on Databricks Community Edition or downloaded from my Github

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.