Useful Things to Know when Starting with PySpark and Databricks

Databricks is a very handy cloud platform for large scale data processing and machine learning using Spark. However it does have some idiosyncrasies. Here are ten hints and tips for getting started with it assuming you are programming in Python using PySpark (you can also use Scala, Java or R with Spark).

1 – The free trial of Databricks won’t run for free on AWS.

The information on free trials at https://databricks.com/try-databricks says that the trial is free but notes that this excludes cloud charges. Unfortunately as set up Databricks requires at least 2 control and 4 worker cores and the small free AWS t2.micro instances are not available as an option.

Screenshot from Databricks cluster management page
Illustration of why a bare bones free AWS instance won’t allow you to run Databricks

You could try  the Azure free trial of Databricks instead at https://docs.azuredatabricks.net/getting-started/index.html to see if this  might work better. (Note I could not try this to see if it worked as I was not eligible for a free trial account on Azure).

The good news is that there is a way to try Databricks for free, the Databricks Community Edition. Details can be found here and the signup is here.

2 – When searching the documentation always check it refers to the correct version

Its easy, particularly when using search engines, to get led to documentation of an older version of PySpark. Be sure to check the version number which appears in the documentation URLs eg https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html . Fortunately the documentation does normally follow a coherent naming pattern, so if you get directed to the wrong page you can simply amend the version number in the URL and reload the page. As of this article in May 2019 the latest version is 2.4.3

3 – You need to check the language of examples you find on the web

When checking StackOverflow or other help sites, be sure to specify PySpark if you want Python examples and check that you are not looking at Spark code in another language. Scala in particular can look superficially similar, but potentially lead you astray. That said if you can’t find a Python example, sometimes it helps to look at examples in other languages to get clues as to how to do something. In particular the Scala examples in the official documentation sometimes have some interesting additions compared to the Python examples.

4 – PySpark has two similarly named but incompatible machine learning libraries

MLlib is Spark’s old machine learning library built on top of the Resilient Distributed Dataset (RDD) abstraction. The documentation for this is at https://spark.apache.org/docs/2.4.3/api/python/pyspark.mllib.html

ML on the other hand is the new machine learning library which takes advantage of Spark’s new DataFrames. The two libraries cannot be used interchangeably. Generally ML seems to be the better version to use in most cases. The documentation for this is at https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html

5 – Lazy execution can lead to unusual effects

When you put operations into a workbook in Databricks and hit Shift-Enter the PySpark compiler adds them to its execution plan, however they do not actually run until an output is required. This allows PySpark to optimise and parallelise your workloads. However it can be a little surprising to start with when a cell containing a machine learning pipeline or a complex SQL query can appear to run almost instantly, but the following display command takes ages. In fact its just that the operations in the previous cell did not fully execute until they were needed. This also means that if a cell fails you may need to check back into previous cells to find the error.

6 – PySpark does not have consistent syntax

Don’t be surprised if you find yourself spending a lot of time looking up syntax in the early days. Part of the issue is that syntax can differ in unpredictable ways between commands. As an example.


# When writing a CSV, overwriting is an optional parameter

df.repartition(1).write.csv(df_path, mode='overwrite', header='true')

# While for a machine learning model overwriting is a method call

ml_model.write().overwrite().save(ml_path)

7 – PySpark is not Pythonic

Spark is actually written in Scala and therefore you may sometimes have to do some things which seem a little unusual in Python. The key one of these is that you will find yourself needing to declare types when performing certain operations in PySpark. There are also style issues like camel cased method names and conventions like square bracket addressing do not work in all the situations you might expect.

8 – PySpark DataFrames are not the same as pandas DataFrames

PySpark dataframes do have many similarities to pandas DataFrames and you can reason about them in a similar way. However they also have differences. For a start PySpark DataFrames do not have indexes and are immutable. Also conventional Pythonic slicing does not work on PySpark DataFrames the way it does in pandas.

PySpark DataFrames are immutable. If you wish to add a new column you need to use the .withColumn command.


# an example of creating a new column in a DataFrame

df = df.withColumn('col_incremented', df.col + 1)

In fact few commands are exactly the same as their pandas equivalent. To help with this I have made a list of basic commands and their pandas equivalents. Hopefully you will find it useful.

9 – User Defined Functions are useful but need to be written in a particular way

Many common functions are available in Pyspsark. However sometimes you will need to define a function of your own. When you do, you’ll need to know how to use UDFs (User Defined Functions) in PySpark. Working out how these should be constructed can be tricky. Here is a general pattern you can use to create them (in some cases you can abbreviate this pattern in various ways once you become more familiar with UDFs).


# many useful functions in pyspark.sql make this import almost standard
from pyspark.sql import functions as fn

# type import since we need to declare a type for the UDF
from pyspark.sql.types import StringType

# an example of a function to be incorporated into a UDF
def get_region(country, region, city):
if country != "United Kingdom":
  return '(Outside of UK)'
if region == 'England':
  if city == 'London':
    return 'London'
  else:
    return 'England (excl. London)'
return region

# incorporate the function into a UDF 
# note the type declaration in the UDF definition
# simple funtions could be incorporated directly into the lambda
udf_region = udf(lambda x, y, z: get_region(x, y, z), StringType())

# use .withColumn and UDF to add a new column to the DataFrame
df = df.withColumn('location', udf_region('country', 'region', 'city'))

Later on you may wish to get clever about how you use UDFs, an example is incorporating Scala UDFs into Python code as shown here

10 – There are a  few free books available online to get you started

Due to the speed with which Spark is being developed, these will likely become dated quite rapidly. As of May 2019 they provide a good starting point for those wishing to learn more about PySpark.

Getting Started with Apache Spark – A good general overview of the Spark ecosystem with links to the documentation. Unfortunately the claimed downloadable PDF no longer appears to be available.

Learning Apache Spark with Python – Probably the most comprehensive free PySpark specific ebook and pdf available. A little dated in parts.

Learning Apache-Spark – A free ebook and pdf compiled from Stack Overflow contributors far from exhaustive but covers a few common issues.

As always the folk over at Datacamp have a cheat sheet available.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.