The updates to graphing with PySpark onDatabricks have made it much nicer to work with. Options exist to aggregate data direct in the graph, handle…
Category: Data Science
With the rise of Large Language Models projects have arisen to try to make LLMs easier to use. One of the most prominent of these…
Recently Databricks made an exciting announcement. They have created a new library which allows you to use large language models to perform operations on PySpark…
Recently I noticed that the ArrayType in PySpark is missing some useful aggregation functions. Lets suppose you have a data frame created as follows: If…
A key concern when designing Machine Learning models is to try to avoid bias which might lead to unfair models. When thinking about how things…
When I moved from using mostly AWS to using mostly Azure at work, I did not find the transition too painful. However while I mostly…
Exploding arrays is often very useful in PySpark. However because row order is not guaranteed in PySpark Dataframes, it would be extremely useful to be…
Sometimes it is useful to not only compute aggregated functions, but also to be able to compare them on a row by row basis with…
If you use PySpark you are likely aware that as well as being able group by and count elements you are also able to group…
The usual approach to prediction problems these days is to create a machine learning model. However machine learning models can struggle to train on sparse…