Pandas User Defined Functions for PySpark

March 7, 2022May 1, 2022 by justinmatters

Today I want to take a look at a neat feature of Pyspark called Pandas User Defined Functions. As the name suggests, a Pandas UDF…

Data Science

Using StackOverflow to Solve Common PySpark Issues

December 28, 2021January 16, 2022 by justinmatters

StackOverflow is a wonderful source of solutions to common yet tricky programming issues. However there are certainly a few things to be aware of when…

Data Science

A Short Snippet for Converting PySpark Schema

November 7, 2021April 11, 2022 by justinmatters

Pyspark schema can be labourious to write. One approach to this issue was discussed previously. However DDL definitions may not meet all needs. Particularly where…

Data Science

Using the PySpark @udf decorator with Currying

September 26, 2021May 8, 2022 by justinmatters

While Pyspark has a broad range of excellent data manipulation functions, on occasion you might want to create a custom function of your own. These…

Data Science

Defending Against Bad Data in Data Pipelines with Property Based Assertions

July 31, 2021July 31, 2021 by justinmatters

Whats worse than having a data pipeline fail. Having a data pipeline apparently succeed but actually fail! While having a pipeline fail is bad, having…

Data Science

Two Handy PySpark Hints

June 12, 2021July 31, 2021 by justinmatters

A couple of quick handy hints people may find useful Checking whether a Delta file exists The delta format is an excellent way to store…

Data Science

Easier Way to Define Schema for PySpark

February 27, 2021March 14, 2021 by justinmatters

If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can…

Data Science

Starting with Great Expectations in Pyspark Notebooks

January 17, 2021January 24, 2021 by justinmatters

Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline…

Data Science

Two handy hints for dealing with messy input data

December 28, 2020December 29, 2020 by justinmatters

Data coming in from external sources can be rather messy. In this article we will look at a couple of handy hints to deal with…

Computer Graphics

Producing Heatmaps from PySpark

September 20, 2020October 18, 2020 by justinmatters

Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found…