Working Around Missing Array Functions in PySpark

April 29, 2023May 28, 2023 by justinmatters

Recently I noticed that the ArrayType in PySpark is missing some useful aggregation functions. Lets suppose you have a data frame created as follows: If…

Data Science

Pandas User Defined Functions for PySpark

March 7, 2022May 1, 2022 by justinmatters

Today I want to take a look at a neat feature of Pyspark called Pandas User Defined Functions. As the name suggests, a Pandas UDF…

Data Science

Starting with Great Expectations in Pyspark Notebooks

January 17, 2021January 24, 2021 by justinmatters

Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline…

Data Science

Building a Custom Data Pipeline Using Curried Functions

July 22, 2019May 8, 2022 by justinmatters

If you work in data science you have probably come across the pipeline model for handling data transformations. It is used by many machine learning…

Data Science

Pandas to PySpark Conversion Cheatsheet

June 1, 2019July 9, 2019 by justinmatters

This is a follow on post from my last post about starting with PySpark and Databricks. Here is a link to a table I have…

Data Science

Using SQLAlchemy to Run SQL Procedures

April 4, 2019July 7, 2023 by justinmatters

Occasionally you may want to invoke a stored procedure from your python code in order to manipulate data as part of a larger task. Naively…

Data Science

Using SQLAlchemy to Export Data from Pandas

March 16, 2019April 4, 2019 by justinmatters

In the last blog post I discussed using SQL Alchemy to import SQL database data into pandas for data analysis. But what if you wish…

Data Science

Using SQLAlchemy to Import Data to Pandas

February 24, 2019April 4, 2019 by justinmatters

Sometimes may want to use Python to extract data from a SQL database to analyse using pandas. There are a couple of issues here. Firstly…

Data Science

Kaggle PUBG Competition Data Analysis

November 19, 2018December 6, 2018 by justinmatters

Currently there is a fun competition running over on the Kaggle Data Science website. The objective is to use metrics from a large data set…

Data Science

Edinburgh Bike Open Data – 2 of 4 – Data Cleaning

October 11, 2018December 6, 2018 by justinmatters

Now we have obtained our dataset from the Edinburgh Open Data store, we need to tidy it up and see if we need to transform…