Retrieving the Index of PySpark Array Elements when Exploding

Exploding arrays is often very useful in PySpark. However because row order is not guaranteed in PySpark Dataframes, it would be extremely useful to be able to also obtain the index of the exploded element as well as the element itself when exploding PySpark dataframes.

The good news is that 2 function exists to do just that. The first is posexplode and the second is posexplode_outer. These two functions are the corollaries of explode and explode_outer in that posexplode ignores nulls while posexplode_outer includes both null arrays and nulls within arrays while exploding them.

posexplode

The posexplode function is the corollary of explode in that posexplode ignores nulls. Its worth noting that the use of posexplode requires you perform the function as part of a select since withColumn adds one column at a time and thus cannot handle the two rows of output created by posexplode. Lets compare the code we need to perform an explode and a posexplode

# if this was how you wanted to explode a column
df_explode = df.withColumn(
    'exploded', fn.explode('values')
)

# then this is how you would posexplode it
df_posexplode = df.select(
    'row', 'values', fn.posexplode('values')
)

posexplode_outer

The posexplode_outer function is the corollary of explode_outer in that posexplode_outer includes both null arrays and nulls within arrays while exploding them. Once again posexplode_outer requires you perform the function as part of a select since withColumn adds one column at a time and thus cannot handle the two rows of output created by posexplode. Thus the code we need to conduct a posexplode is as follows:

df_posexplode_outer = df.select(
    'row', 'values', fn.posexplode_outer('values')
)

Renaming the output columns

by default the output columns are called ‘pos’ and ‘col’, if you want to rename them you can use alias. You may only be familiar with aliasing single columns, but the way alias works, it takes *args and tries to match them to the number of columns being aliased. Thus the way to rename your two new columns is as follows

df_posexplode_outer = df.select(
    'row', 
    'values', 
    fn.posexplode_outer('values').alias("index", "column")
)

Important note about indexing

It is important to remember that posexplode and posexplode_outer index from zero in a pythonic fashion. This is different to some other PySpark functions such substring as which index from one. An example workbook can her found here or here to see what the the code samples above give as output.

Retrieving the Index of PySpark Array Elements when Exploding

posexplode

posexplode_outer

Renaming the output columns

Important note about indexing

Published by justinmatters

Leave a Reply Cancel reply