Using Shell Scripts with PySpark on Databricks

February 25, 2024March 4, 2024 by justinmatters

Sometimes a situation will crop up where you want access functionality in Databricks which is not readily accessible via Python. In these cases Databricks allows…

Data Science

Building Paths in PySpark

November 26, 2023 by justinmatters

I recently came across a strange little problem with a satisfying solution which people building path based models might be interested in. The Problem Imagine…

Data Science

Graphing in Databricks

September 30, 2023 by justinmatters

The updates to graphing with PySpark onDatabricks have made it much nicer to work with. Options exist to aggregate data direct in the graph, handle…

Artificial Intelligence

Is LangChain Useful?

July 2, 2023March 1, 2026 by justinmatters

With the rise of Large Language Models projects have arisen to try to make LLMs easier to use. One of the most prominent of these…

Artificial Intelligence

Using AI on PySpark workloads with Databricks

June 30, 2023March 1, 2026 by justinmatters

Recently Databricks made an exciting announcement. They have created a new library which allows you to use large language models to perform operations on PySpark…

Data Science

Working Around Missing Array Functions in PySpark

April 29, 2023May 28, 2023 by justinmatters

Recently I noticed that the ArrayType in PySpark is missing some useful aggregation functions. Lets suppose you have a data frame created as follows: If…

Data Science

Avoiding Cognitive Bias

March 31, 2023April 2, 2023 by justinmatters

A key concern when designing Machine Learning models is to try to avoid bias which might lead to unfair models. When thinking about how things…

Data Science

Azure Fundamentals Exams

January 25, 2023April 2, 2023 by justinmatters

When I moved from using mostly AWS to using mostly Azure at work, I did not find the transition too painful. However while I mostly…

Data Science

Retrieving the Index of PySpark Array Elements when Exploding

December 30, 2022January 8, 2023 by justinmatters

Exploding arrays is often very useful in PySpark. However because row order is not guaranteed in PySpark Dataframes, it would be extremely useful to be…

Data Science

Combining Original Values with Global Aggregations in PySpark

November 26, 2022January 8, 2023 by justinmatters

Sometimes it is useful to not only compute aggregated functions, but also to be able to compare them on a row by row basis with…