When you ask Databricks to write a JSON file, you may be surprised by the results. Rather than writing a simple individual JSON file, instead…
Tag: Databricks
The updates to graphing with PySpark onDatabricks have made it much nicer to work with. Options exist to aggregate data direct in the graph, handle…
Recently Databricks made an exciting announcement. They have created a new library which allows you to use large language models to perform operations on PySpark…
A couple of quick handy hints people may find useful Checking whether a Delta file exists The delta format is an excellent way to store…
Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline…
Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated…
Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found…
PySpark window functions are useful when you want to examine relationships within groups of data rather than between groups of data as for groupBy. To…
Recently I needed to check for equality between Pyspark dataframes as part of a test suite. To my surprise I discovered that there is no…
I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.…