PySpark List Matching

There are a variety of ways to filter strings in PySpark, each with their own advantages and disadvantages. This post will consider three of the most useful. It will also show how one of them can be leveraged to provide the best features of the other two. An accompanying workbook can be found on Databricks community edition.

contains

The contains function allows you to match strings or substrings within a databricks column as part of a filter. The general syntax is as follows:


display(df.filter(fn.col("col_1").contains("searchstring")))

isin

The isin function allows you to match a list against a column. If any of the list contents matches a string it returns true. However unlike contains the match must be exact, substrings are not matched.


display(df.filter(fn.col("col_1").isin(["search", "string"])))

rlike

The rlike function is the most powerful of the functions, it allows you to match any regular expression (regex) against the contents of a column. Remember that you may need to escape regex special characters in certain cases. The only disadvantage to rlike is that because it relies on regular expressions, its matching syntax can become a little cryptic in complex situations

 
display(df.filter(fn.col("col_1").rlike("searchstring")))

Combining Functionality

The question arises, what if we want to combine the power of these different functions. For instance what if we want to match a list of substrings? The naive approach might be to iterate over the separate substrings using contains and then

items = ["search", "string"]
df_out  = None
for item in items:
  if df_out == None:
    df_out = df.filter(fn.col("col_1").contains(item))
  else:
    df_out = df_out.union(df.filter(fn.col("col_1").contains(item)))
display(df_out)

However this approach has several flaws:

  1. It uses a loop which reduces PySpark’s ability to parallelise the work
  2. It evaluates against every substring whether it needs to or not
  3. It will duplicate lines which match more than one of the given criteria unless further code is introduced

A much better approach is to make use of the power of the rlike function. Note that the pipe character | is a logical or in regular expressions

items = ["search", "string"]
# concatenate the list into a regex string
regex_string = "|".join(items)
# use rlike to do the comparison
df_out = df.filter(fn.col("col_1").rlike(regex_string))
display(df_out)

This overcomes the problems with the previous method

  1. The loop is removed, rlike will evaluate our while list against each row  simultaneously
  2. Evaluation will be lazy due to the concatenation of the substrings with |
  3. Incorrect duplicates are avoided without the need for further checks

If we suspect that we may have strings in our list which may contain regex special characters this can easily be handled as follows

 
import re
items = ["www.google", "://"]
# convert the list to a list of regex safe escaped strings
escaped_items = [re.escape(x) for x in items]
# concatenate the list into a regex string
regex_string = "|".join(escaped_items)
# use rlike to do the comparison
df_out = df.filter(fn.col("col_2").rlike(regex_string))
display(df_out)

Better still, by suitable modification of our regex expression, we could use such a concatenated regex expression to search not just for any substring, but for prefixes, suffixes or other special cases as desired. For more information on how to use regular expression a good place to start is https://www.regular-expressions.info/quickstart.html . A good regex tester can be found at https://regexr.com/

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.