StackOverflow is a wonderful source of solutions to common yet tricky programming issues. However there are certainly a few things to be aware of when refering to it for PySpark. This article will discuss those pitfalls and also point out a few commonly useful StackOverflow articles for PySpark.
Things to Bear in Mind
The accepted or most upvoted solution on StackOverflow is not always the best. This is partly because PySpark is still under development and the best approach may change over time and partly because the best answer may be situational.
Sometimes you will find the only solution you can find to a Spark problem on StackOverflow is in Scala or Java rather than Python. The Spark community is fairly small so question coverage is not complete. The good news is thatĀ Scala solutions tend to be similar in syntax to those of Python making translation easy. Even Java answers can often give you strong clues about approach if not syntax.
Some elements of PySpark are hard to search for because they depend on symbols. These include negation of filter and when conditions where a tilde symbol ~ should be used and use of * syntax when accessing the contents of struct columns. In these cases the SymbolHound search engine may be your friend. Otherwise you may need to rephrase your question.
There are inconsistencies around import calls for common functions like pyspark.sql.functions.col
. Personally I like to use from pyspark.sql import functions as fn, types as T
since this seems to be the most commonly used import convention, but functions as F or importing individual functions is also not uncommon. Just bear this in mind when searching.
Finally if you can’t find an answer, perhaps you have a novel question. You could submit a query of your own. Be sure to follow the best practise for asking questions on StackOverFlow.
Useful Solutions
Here are some solutions I seem to keep using or refering other people to:
Summing a column to a python variable
Lets start with one of those little tricks you find youreself using all the time. The code required is
python_variable = df.groupBy().sum().collect()[0][0]
The trick is to collect and then reach into the row object returned to extract the value by indexing. Credit where credit is due, this is where I first saw how to do it
You can also extract an entire column to a python list
https://stackoverflow.com/questions/38610559/convert-spark-dataframe-column-to-python-list
Handle importing data with unpredictable schema
Sometimes you cannot be sure of the schema of data you are importing in advance. Or it may be mismatched between source files. No problem, this solution has you covered
https://stackoverflow.com/questions/39083873/spark-2-0-0-reading-json-data-with-variable-schema
Counting nulls by a grouping condition
Sometimes you may want to count the number of nulls present in groups defined by a column. This neat answer has you covered
https://stackoverflow.com/questions/55265954/pyspark-dataframe-groupby-and-count-null-values
Cast structured columns to JSON strings
There can be a number of reasons to want to do this, most notably if you want to group or order by a map column, or if you want to export to a flat structure like a CSV. The simple answer is that you need to use to_json, but here is an interesting helper function (which can easily have MapType columns added).
https://stackoverflow.com/questions/41730369/spark-cast-structtype-json-to-string
Extracting from map columns
We can extract leys and values from map type columns. The following StackOverflow article details how though the documentaion reference has now moved.
For the documentation on extracting using map_keys see here for getting values with map_values checkĀ here. Note that both these methods extract in an unordered fashion so don’t expect that the two lists must neccessarily match up
Another neat trick is the following answer on how to extract a specific key value from an array of maps to a new array column
Extract a column of lists of tuples to separate lists in separate columns
A little similar to the key value extraction above but for a different situation
Dropping and adding nested columns
Here is a neat approach to dropping specific nested columns
https://stackoverflow.com/questions/45061190/dropping-nested-column-of-dataframe-with-pyspark
While here is how to add a column to a nested struct