There are a variety of ways to filter strings in PySpark, each with their own advantages and disadvantages. This post will consider three of the most useful. It will also show how one of them can be leveraged to provide the best features of the other two. An accompanying workbook can be found on Databricks community edition.
contains
The contains function allows you to match strings or substrings within a databricks column as part of a filter. The general syntax is as follows:
display(df.filter(fn.col("col_1").contains("searchstring")))
isin
The isin function allows you to match a list against a column. If any of the list contents matches a string it returns true. However unlike contains the match must be exact, substrings are not matched.
display(df.filter(fn.col("col_1").isin(["search", "string"])))
rlike
The rlike function is the most powerful of the functions, it allows you to match any regular expression (regex) against the contents of a column. Remember that you may need to escape regex special characters in certain cases. The only disadvantage to rlike is that because it relies on regular expressions, its matching syntax can become a little cryptic in complex situations
display(df.filter(fn.col("col_1").rlike("searchstring")))
Combining Functionality
The question arises, what if we want to combine the power of these different functions. For instance what if we want to match a list of substrings? The naive approach might be to iterate over the separate substrings using contains and then
items = ["search", "string"] df_out = None for item in items: if df_out == None: df_out = df.filter(fn.col("col_1").contains(item)) else: df_out = df_out.union(df.filter(fn.col("col_1").contains(item))) display(df_out)
However this approach has several flaws:
- It uses a loop which reduces PySpark’s ability to parallelise the work
- It evaluates against every substring whether it needs to or not
- It will duplicate lines which match more than one of the given criteria unless further code is introduced
A much better approach is to make use of the power of the rlike function. Note that the pipe character | is a logical or in regular expressions
items = ["search", "string"] # concatenate the list into a regex string regex_string = "|".join(items) # use rlike to do the comparison df_out = df.filter(fn.col("col_1").rlike(regex_string)) display(df_out)
This overcomes the problems with the previous method
- The loop is removed, rlike will evaluate our while list against each row simultaneously
- Evaluation will be lazy due to the concatenation of the substrings with |
- Incorrect duplicates are avoided without the need for further checks
If we suspect that we may have strings in our list which may contain regex special characters this can easily be handled as follows
import re items = ["www.google", "://"] # convert the list to a list of regex safe escaped strings escaped_items = [re.escape(x) for x in items] # concatenate the list into a regex string regex_string = "|".join(escaped_items) # use rlike to do the comparison df_out = df.filter(fn.col("col_2").rlike(regex_string)) display(df_out)
Better still, by suitable modification of our regex expression, we could use such a concatenated regex expression to search not just for any substring, but for prefixes, suffixes or other special cases as desired. For more information on how to use regular expression a good place to start is https://www.regular-expressions.info/quickstart.html . A good regex tester can be found at https://regexr.com/