Data Scotland 2024

I recently attended Data Scotland 2024 and attended a number of interesting talks. Here are some takeaways. Where possible I have linked to the talk slides and also to the presenter’s web presence. I can’t say how long these links will be available, but there is always the wayback machine.

Introduction to Performance Tuning on Databricks – Niall Langley

A nice talk discussing some of the important considerations when performance tuning on Databricks. The talk covered Spark’s memory management; narrow vs broad (or shuffling) transforms; data skew; jobs, stages and tasks; caching and coalescing; explicit and implicit broadcast joins; the importance of filtering when performing delta merges and the inspection tools available in Databricks for investigating performance issues. I was aware of most of these considerations, but this talk brought things together in a clear and concise manner. Unfortunately the slides from the presentation are not available from the Data Scotland website, but given the list of optimisation approaches, you can find useful articles covering many of the the approaches online such as this one and this one.

AI at Work – Chrissy LeMaire

This talk covered a lot of use cases for LLMs, some obvious, some more surprising. Included a review of currently available LLM technologies. A worthwhile talk, focussed mainly on using AI to help with your day to day work rather than on integrating AI into pipelines or products

Your first Microsoft Fabric Lakehouse implementation – Bas Land

A solid introduction to building a basic setup in Fabric. I had been interested in this as a possible competitor to other OLAP serving solutions like Synapse, Databricks SQL and Apache Druid, since we need an improved OLAP capability at work at QueryClick. It certainly looks like it will be a powerful solution in time. However speaking to other attendees and vendors around the conference, it appears that the general feeling is that Fabric is not ready for prime time just yet. For now I think we will continue to pursue Druid as the most cost effective solution

20 PowerBI Tips in 20 Minutes – Matt Lakin

A nice variety of PowerBI tips and tricks, some well known, some less well known but some good information here.

Let’s Add AI Assistant to your Local SQL Server in 20 minutes – Sergey Olontsev

This was a neat little tech demo using LLMs to combine the schema of your SQL database and a natural language query to give you the SQL to run that natural language query against the database. Potentially a useful ability to let less technical users query a SQL database, but with caveats. Firstly, you will need to be very careful to guard against hazards like SQL injection or data corruption by sanitising the LLm output before it is run. Secondly, there is a danger that the LLM might create a valid but incorrect query which returns apparently valid data which is not the data the user desires. This sort of thing is still in its infancy, but shows promise. Unfortunately the demo does not appear to be available online, but there are articles from other authors discussing similar approaches.

Quest to Delta Optimisation – Falek Miah

A useful talk about optimisation techniques when working with delta files. It went through the use of vacuum to remove obsolete files . It also covered z-ordering and optimise operations to compact fragmented files and sort the data into an optimal retrieval order. Finally the talk discussed liquid clustering which arrived with Databricks Runtime 13.3 LTS. Liquid Clustering is the new default and its advantages and disadvantages were explained.

As an aside, Falek has written a useful blog about Databricks Unity Catalog

Ingesting and analyzing millions of events per second in real-time using open source tools – Javier Ramirez

This was a seriously impressive tech demo of how to use Open Source tools to ingest realtime data at mindblowing event rates on comparatively modest hardware. Initial ingestion was via Apache Kafka to Avro format, this was followed by using Kafka Connect to perform basic transforms and then pipe to a QuestDB database. As if that was not enough, this was then presented as dashboards in Grafana, with monitoring using Telegraf and analysed using Jupyter notebooks. All super impressive. The code for the dockerised demo setup is available on Javier’s github

Overall

Overall the quality of talks at Data Scotland 2024 was very high. Perhaps the most useful aspect was the chance to get introductions to unfamiliar technologies from knowledgeable presenters. This helped me to understand whether the technology was worth me investigating further. With attendance being free, covered by the generous event sponsors, there is really no reason not to attend Data Scotland events