A Neat Way to Count Distinct Rows with Window functions in PySpark

If you use PySpark you are likely aware that as well as being able group by and count elements you are also able to group by and count distinct elements.

df_group_count = df.groupBy(
    'group'
).agg(
    fn.count('values')
)

df_group_count_distinct = df.groupBy(
    'group'
).agg(
    fn.count_distinct('values')
)

However the picture is a little different when using window functions. Sure you can count as part of a window function, that works like this

window = Window.partitionBy('group')

df_window_count = df.withColumn(
    'count', fn.count('values').over(window)
)

However when you try to count distinct, you will discover that an equivalent function does not exist for window functions. Trying this results in an error:

window = Window.partitionBy('group')

df_window_count = df.withColumn(
# note this leads to an error
'count', fn.count_distinct('values').over(window)
)

### AnalysisException: Distinct window functions are not supported

Not to worry though, there are a couple of functions we can combine to solve the problem. these are collect_set and size. we can collect the set of elements from each window which eliminates duplicates and then get the set size to obtain the value we wish to get from the missing windowed count_distinct function. The code looks like this:

window = Window.partitionBy('group')

df_window_count = df.withColumn(
    'count', fn.size(fn.collect_set('values').over(window))
)

So there you go, a way to count distinct rows when using window functions. Example code in a notebook can be found here and here.

One Reply to “A Neat Way to Count Distinct Rows with Window functions in PySpark”

“Ah, PySpark – where the simple task of counting distinct values becomes an accidental adventure! 🎢 But hey, at least we get to flex those creative SQL muscles!”

Wacky Flip says:

May 20, 2025 at 2:46 pm

“Ah, PySpark – where the simple task of counting distinct values becomes an accidental adventure! 🎢 But hey, at least we get to flex those creative SQL muscles!”

A Neat Way to Count Distinct Rows with Window functions in PySpark

Published by justinmatters

One Reply to “A Neat Way to Count Distinct Rows with Window functions in PySpark”

Leave a Reply Cancel reply