While Pyspark has a broad range of excellent data manipulation functions, on occasion you might want to create a custom function of your own. These…
Category: Problem Solving
Whats worse than having a data pipeline fail. Having a data pipeline apparently succeed but actually fail! While having a pipeline fail is bad, having…
Azure Functions offer the opportunity to run a wide variety of code in a serverless fashion. Function Apps generally work well for microservices and other…
If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can…
Data coming in from external sources can be rather messy. In this article we will look at a couple of handy hints to deal with…
Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated…
Here are a few useful resources and hints and tips for Linux. Mostly these apply to Ubuntu on the command line, and in particular some…
Most companies are aware that their IT solutions contain technical debt. However it tends not to be a high priority to fix it. There tends…
Recently I needed to check for equality between Pyspark dataframes as part of a test suite. To my surprise I discovered that there is no…
Recently I have been involved in organising a free online table top games convention called ScaleCon. This came about at very short notice due to…