Data Scotland 2025

How time flies, Data Scotland 2025 was another great event. Here are the things I learned about this year. Where possible I have linked to the presenter’s web presence and web links relevant to the topics they discussed. I can’t say how long these links will be available, but there is always the wayback machine. Sadly we have learned that the next Data Scotland may not be until 2027, it will be missed

Smashing the Bottleneck: Scaling Data Accessibility for an Organisation – Jamie McLaughlin

A high level talk about the barriers to data accessibility and ways to address both the human and technical issues surrounding data sharing within an organisation. As part of the talk Jamie gave an overview of features of Microsoft Fabric which he believes makes it a strong platform for data sharing.

Jamie discussed governance via Entra and Purview. He also looked at features like data dictionaries, low code canvases and layout view. Also highlighted was report endorsement. This not only allows for something similar to code review for reports and visualisations. It can also be used to encourage and assist end users to self serve for reports they need. Best of all those reports can then be made available to others in the organisation.

Unlocking the Potential of Retrieval-Augmented Generation (RAG) with Advanced Patterns – Tori Tompkins

An excellent guide to the current state of the art in Retrieval Augmented Generation. The talk emphasised its costs and benefits compared to unagmented LLM systems and finetuning or training custom LLMs. Tori covered a number of approaches to RAG systems starting with Naive RAG, RAG Fusion, Corrective RAG, REFRAG and Graph RAG. She also highlighted simple methods to get the most out of a RAG system like spell correcting initial inputsand breaking queries into components for retreival to get better responses.

Tori also delved into methods for developing RAG-LLM systems including: using LangGraph for tracing; using MLFLow or RAGAs for storing different experiments in system tuning such as different chinking or embedding strategies. The presentation also suggested some methods for testing output performance such as similarity metrics for repetitions of the same query and BLEU and ROUGE NLP metrics. An excellent and engaging talk with a lot of food for thought. Tori also has a blog.

My Real-Time Intelligence Journey: From Power BI Data Visualization to Streaming Data – Valerie Junk

An entertaining talks about some of the perils and pitfalls that you can face when trying to integrate real time data with PowerBI. Explained why real time dashboards may wish to make some compromises on the data presented and use an approach like Fabric Real Time Dashboards and KQL (Kusto Query Language) rather than PowerBI and DAX.

Valerie introduced a few useful pointers for those starting with real time dashboards such as training available from Microsoft, Fabric’s three built in streaming data sources, and a neat gamified tutorial to gain experience with KQL. Also highlighted was the differing use case for real time data as compared to more conventional reports and the importance of contextual highlighting to help users identify anomalies.

Crafting LLM Applications: Design, Build, and Evaluate – Shubhangi Goyal

A quick overview of some considerations when designing LLM based systems, including the reasons to use an LLM compared to another method, how to improve responses with prompt engineering such as few shot examples, chain of thought reasoning, RAG, and CAG. Also mentioned was the ability to use Azure Prompt Flow to help in model evaluation. Quite a lot packed into a short presentation slot.

No-Compromise Data Apps: Why Streamlit is the Missing Piece in Your Analytics Stack – Barry Smart

Barry gave a fascinating talk on how to use Streamlit for rapid prototyping of everything from data visualisation, to data validation, to data modelling; and even front ends for LLM models (as I have previously detailed myself). This included a number of really impressive fast demos, using UV for Python environments and VSCode with Demo Time to run the demos. This showed vividly how to rapid prototype with Streamlit.

The dataframe handling demonstration showed the use of st.dataframe, st.data _editor and st.plotly_chart. Data validation made use of st.fill_uploader and Pandera for data checks. The use of @st.cache_data and st.session_state to carry data across the restart of a Streamlit app when its inputs change were also demonstrated. Presentation options such as st.columns, st.spinner and st.expander were also shown. Combine this with demonstrating conversion of a Jupyter notebook to a Streamlit app and it was quite the whirlwind tour!

Luckily Barry and the other folk at Endjin have an online article on Streamlit to delve into at leisure, and another one about how to containerise it.

Your first SQL Database implementation in Microsoft Fabric – Kamil Nowinski

This presentation focussed on how moving a SQL database into Azure can allow for a more flexible “Translytical” approach to your data. The public preview of SQL Database in Microsoft Fabric was demonstrated. Synching database schema to Azure DevOps or Github using DACPAC was also demonstrated. Thus the presentation was largely a tech demo, but an interesting one and the inclusion of SQL into Fabric will doubtless be a welcome addition. The slides that accompanies the demo can be found on Kamil’s Github.

Apache Flink 101: Build a Real-time Data Stream Processing Pipeline with Apache Flink – Daren Wong

An interesting talk about Apache Flink and how it fills a niche. Lightweight real time processing is provided by Apache Kafka Streams. Heavy weight real time processing like Spark Structured Streaming sits at the other end. In between is Flink. Alongside a use case comparison, Daren included a couple of demos of Flink’s capabilities such as anomaly detection, window averages, data joins, and count aggregation.

Overall I got the impression that Flink serves a fairly specific role. Its sources and sinks overlap heavily with those for Kafka and Spark. As such many applications will be better served by Kafka Streams or Spark, but it is always interesting to gain awareness of alternatives.