I gave a flash talk at Edinburgh Pydata yesterday. It covered the merits and pitfalls of PySpark and Databricks as a big data processing platform.…
Tag: Python
There are a variety of ways to filter strings in PySpark, each with their own advantages and disadvantages. This post will consider three of the…
Jupyter notebooks are great, however the interface for file handling has its issues. One issue is that files have to be downloaded individually, there is…
Pyspark is very powerful. However because it is based on Scala we need to be careful about types as they are not Pythonic. And because…
If you work in data science you have probably come across the pipeline model for handling data transformations. It is used by many machine learning…
Databricks is a very handy cloud platform for large scale data processing and machine learning using Spark. However it does have some idiosyncrasies. Here are…
Occasionally you may want to invoke a stored procedure from your python code in order to manipulate data as part of a larger task. Naively…
In the last blog post I discussed using SQL Alchemy to import SQL database data into pandas for data analysis. But what if you wish…
Sometimes may want to use Python to extract data from a SQL database to analyse using pandas. There are a couple of issues here. Firstly…
One objection many people have to Jupyter Notebooks is the difficulty of producing clean code in them. Lets look at a few tools to help…