Writing neat JSON output with PySpark on Databricks

July 20, 2024July 25, 2024 by justinmatters

When you ask Databricks to write a JSON file, you may be surprised by the results. Rather than writing a simple individual JSON file, instead…

Data Science

Graphing in Databricks

September 30, 2023 by justinmatters

The updates to graphing with PySpark onDatabricks have made it much nicer to work with. Options exist to aggregate data direct in the graph, handle…

Data Science

Using AI on PySpark workloads with Databricks

June 30, 2023July 10, 2023 by justinmatters

Recently Databricks made an exciting announcement. They have created a new library which allows you to use large language models to perform operations on PySpark…

Data Science

Two Handy PySpark Hints

June 12, 2021July 31, 2021 by justinmatters

A couple of quick handy hints people may find useful Checking whether a Delta file exists The delta format is an excellent way to store…

Data Science

Starting with Great Expectations in Pyspark Notebooks

January 17, 2021January 24, 2021 by justinmatters

Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline…

Data Science

Getting jobs with bespoke libraries to run programatically on Databricks

October 24, 2020 by justinmatters

Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated…

Computer Graphics

Producing Heatmaps from PySpark

September 20, 2020October 18, 2020 by justinmatters

Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found…

Data Science

PySpark Window Functions

August 30, 2020September 1, 2020 by justinmatters

PySpark window functions are useful when you want to examine relationships within groups of data rather than between groups of data as for groupBy. To…

Data Science

Checking Dataframe equality in Pyspark

May 31, 2020September 1, 2020 by justinmatters

Recently I needed to check for equality between Pyspark dataframes as part of a test suite. To my surprise I discovered that there is no…

Data Science

My PyData Presentation on PySpark and Databricks

March 6, 2020March 22, 2020 by justinmatters

I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.…