If you use PySpark you are likely aware that as well as being able group by and count elements you are also able to group by and count distinct elements.
df_group_count = df.groupBy( 'group' ).agg( fn.count('values') ) df_group_count_distinct = df.groupBy( 'group' ).agg( fn.count_distinct('values') )
However the picture is a little different when using window functions. Sure you can count as part of a window function, that works like this
window = Window.partitionBy('group') df_window_count = df.withColumn( 'count', fn.count('values').over(window) )
However when you try to count distinct, you will discover that an equivalent function does not exist for window functions. Trying this results in an error:
window = Window.partitionBy('group') df_window_count = df.withColumn( # note this leads to an error 'count', fn.count_distinct('values').over(window) ) ### AnalysisException: Distinct window functions are not supported
Not to worry though, there are a couple of functions we can combine to solve the problem. these are collect_set and size. we can collect the set of elements from each window which eliminates duplicates and then get the set size to obtain the value we wish to get from the missing windowed count_distinct function. The code looks like this:
window = Window.partitionBy('group') df_window_count = df.withColumn( 'count', fn.size(fn.collect_set('values').over(window)) )
So there you go, a way to count distinct rows when using window functions. Example code in a notebook can be found here and here.
“Ah, PySpark – where the simple task of counting distinct values becomes an accidental adventure! 🎢 But hey, at least we get to flex those creative SQL muscles!”