Using the PySpark @udf decorator with Currying

While Pyspark has a broad range of excellent data manipulation functions, on occasion you might want to create a custom function of your own. These are called User Defined Functions,  or UDFs, and I have written about them before. As well as the standard ways of using UDFs covered previously, PySpark also has an @udf decorator. Decorators have some distinct advantages for code readability and compactness. However we might want to add in variables to create a more flexible UDF. Can we still do this with decorators?

Creating UDFs

Lets start by remembering the standard way that we might create a UDF. Lets assume we have a dataframe df with a column of integers called “integers”. We could do the following.


from pyspark.sql import functions as fn

def multiply_by_two(number):
    return number * 2

udf_multiply_by_two = fn.udf(multiply_by_two, T.IntegerType())

display(df.withColumn(
    "times_two", udf_multiply_by_two("integers")
))

Here we create a function `multiply_by_two` (its trivial but this is just to demonstrate). Then we use fn.udf to convert this into a User Defined Function which can be applied to a column in our dataframe.

Using the @udf Decorator

That is great but we can do this more compactly by  using the @udf decorator thus:


@udf(T.IntegerType())
def decorator_multiply_by_three(number):
    return number * 3

display(df.withColumn(
    "decorator_times_three", decorator_multiply_by_three("integers")
))

Here fn.udf is applied as a decorator which saves us having to create a second function from our desired function.

Currying in Additional Parameters

This seems handy, but can we still easily introduce additional parameters not contained in our dataframe by currying. If you are not familiar with this concept you may want to look here. The good news is, yes we can. As a side note I have heard this method referred to as closure as well. Closure is likely more technically correct. However “closuring in a variable’ does not really give the same sense of what is happening as ‘currying in a variable’. First remember how we normally curry information into the function we want to use as a UDF. We do this by adding a wrapper function as seen here:


def curry_multiply_by_n(n):
    def multiply_by_n(number):
        return number * n
    return multiply_by_n

multiply_by_four = curry_multiply_by_n(4)

udf_multiply_by_four = fn.udf(multiply_by_four, T.IntegerType())

display(df.withColumn(
    "decorator_times_four", udf_multiply_by_four("integers")
))

We can replace the step where we create a UDF with the @udf decorator like this

def curry_value(num_times):
    @udf(T.IntegerType())
    def decorator_multiply_by_n(number):
        return number * num_times
    return decorator_multiply_by_n

curried_decorator = curry_value(5)

display(df.withColumn(
    "decorators_times_five", curried_decorator("integers")
))

Better still, we can actually take advantage of the fact that functions are objects by using double bracket notation to hand our decorated function its parameters when it is run rather than creating a separate function. This can be done like this

def curried_times_n(num_times):
@udf(T.IntegerType())
def decorator_multiply_by_n(number):
return number * num_times
return decorator_multiply_by_n

# note the double brackets after curry_value
# allowing parameterisation when the function is called
display(df.withColumn(
    "decorators_times_six", curried_times_n(6)("integers")
))

Not so Useful! Making a Currying Decorator

The next question is. Can we make the currying wrapper into a decorator as well. The answer is yes, but this is less helpful than you might think. Here is one way to do it:

import functools
def repeat(num_times):
    def decorator_repeat(func):
        @functools.wraps(func)
        def wrapper_repeat(*args, **kwargs):
            # note multiplication is done here now
            value = func(*args, **kwargs) * num_times
            return value
        return wrapper_repeat
    return decorator_repeat

@repeat(7)
@udf(T.IntegerType())
def decorator_multiply_by_n(number):
    # note that we just return number here since
    # the wrapper does the multiplication
    return number

display(df.withColumn(
    "decorators_times_seven", decorator_multiply_by_n("integers")
))

Seems cool right. Well not really. The problem is that we have lost the ability to use double bracket notation to introduce the curried parameter at runtime. Also we have made the function considerably more cryptic because much of its functionality now neesds to be inside the decorator function. Overall this is likely not an improvement and we should keep things simple as in the example above.

If you want to see a runnable example of this (including a demonstration of how the decorator method fails for double bracket notation). You can take a look at a demonstration workbook on Databricks Community Edition (for the next 6 months or so at least).

Update: saved a copy of the workbook on Github as well

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.