StackOverflow is a wonderful source of solutions to common yet tricky programming issues. However there are certainly a few things to be aware of when…
Category: Problem Solving
Pyspark schema can be labourious to write. One approach to this issue was discussed previously. However DDL definitions may not meet all needs. Particularly where…
While Pyspark has a broad range of excellent data manipulation functions, on occasion you might want to create a custom function of your own. These…
Whats worse than having a data pipeline fail. Having a data pipeline apparently succeed but actually fail! While having a pipeline fail is bad, having…
Azure Functions offer the opportunity to run a wide variety of code in a serverless fashion. Function Apps generally work well for microservices and other…
If you have ever had to define a schema for a PySpark dataframe, you will know it is something of a rigmarole. Sometimes we can…
Data coming in from external sources can be rather messy. In this article we will look at a couple of handy hints to deal with…
Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated…
Here are a few useful resources and hints and tips for Linux. Mostly these apply to Ubuntu on the command line, and in particular some…
Most companies are aware that their IT solutions contain technical debt. However it tends not to be a high priority to fix it. There tends…