Defending Against Bad Data in Data Pipelines with Property Based Assertions

July 31, 2021July 31, 2021 by justinmatters

Whats worse than having a data pipeline fail. Having a data pipeline apparently succeed but actually fail! While having a pipeline fail is bad, having…

Data Science

Two Handy PySpark Hints

June 12, 2021July 31, 2021 by justinmatters

A couple of quick handy hints people may find useful Checking whether a Delta file exists The delta format is an excellent way to store…

Data Science

Parameterising Azure Functions and Handling Secrets

April 3, 2021April 3, 2021 by justinmatters

Now that we have covered getting basic Python functions to work in Azure Functions, lets see about making them more useful. Two things it would…

Data Science

Easier Way to Define Schema for PySpark

February 27, 2021March 14, 2021 by justinmatters

If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can…

Data Science

Starting with Great Expectations in Pyspark Notebooks

January 17, 2021January 24, 2021 by justinmatters

Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline…

Data Science

Two handy hints for dealing with messy input data

December 28, 2020December 29, 2020 by justinmatters

Data coming in from external sources can be rather messy. In this article we will look at a couple of handy hints to deal with…

Data Science

Getting jobs with bespoke libraries to run programatically on Databricks

October 24, 2020 by justinmatters

Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated…

Computer Graphics

Producing Heatmaps from PySpark

September 20, 2020October 18, 2020 by justinmatters

Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found…

Data Science

PySpark Window Functions

August 30, 2020September 1, 2020 by justinmatters

PySpark window functions are useful when you want to examine relationships within groups of data rather than between groups of data as for groupBy. To…

Data Science

Checking Dataframe equality in Pyspark

May 31, 2020September 1, 2020 by justinmatters

Recently I needed to check for equality between Pyspark dataframes as part of a test suite. To my surprise I discovered that there is no…