Producing Heatmaps from PySpark

Recently I needed to plot some geographic data I had been working on in Pyspark on DataBricks. I did a bit of research and found a handy method buried inside some of Databricks own training. I thought I would briefly describe how to do it in the hope it might prove useful to others.

Heatmap of population in UK cities

For this demonstration I have created a workbook which pulls some open source data (not the data I was working on) with latitude and longitude data from the simplemaps.com. Due to the way this was presented on the web it proved simpler to load the data via an intermediate Pandas dataframe.

from urllib.request import Request, urlopen
import pandas as pd

# formulate a request
req = Request("https://simplemaps.com/static/data/country-cities/gb/gb.csv")
req.add_header(
  'User-Agent', 
  'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'
)
content = urlopen(req)

# use the request to populate a dataframe and convert to a spark dataframe
pandas_df = pd.read_csv(content)
df = spark.createDataFrame(pandas_df)

# some towns in the data do not have a population stated
# lets assume they are smaller towns and fill population in as 5000
df = df.fillna(5000)

We need to identify the latitude and longitude variables along with the variable we want to plot. We then extract those variables, converting them into a long JSON formatted string representing a list of lists which we can later insert into some HTML code.

from pyspark.sql.functions import col

data = ",\n".join(
  map(
    lambda row: "[{}, {}, {}]".format(
      row[0], row[1], row[2]
    ), df.select(
      col("lat"),col("lng"),col("population")/1000
    ).collect()
  )
)

To actually plot the data we will make use of the Leaflet HTML plotting library and a heatmap extension for it. These are both open source, and can be pulled in from readily available sources on the web, though if you are going to be making a lot of plots you should obviously host this code yourself for stability and out of politeness.

To make the HTML call Databricks where I was working has a handy displayHTML function. If you are using PySpark in a different environment you will need to use an HTML calling library.

The code to actually do the plotting simply calls the relevant Leaflet plotting code, tell it the location of the relevant OpenStreetMap geography tiles and then feeds it our data. The code looks like this

displayHTML("""
<html>
<head>
 <link rel="stylesheet" href="https://unpkg.com/leaflet@1.3.1/dist/leaflet.css"
   integrity="sha512-Rksm5RenBEKSKFjgI3a41vrjkw4EVPlJ3+OiI65vTjIdo9brlAacEuKOiQ5OFh7cOI1bkDwLqdLw3Zg0cRJAAQ=="
   crossorigin=""/>
 <script src="https://unpkg.com/leaflet@1.3.1/dist/leaflet.js"
   integrity="sha512-/Nsx9X4HebavoBvEBuyp3I7od5tA0UzAxs+j83KgC8PU0kgB4XiK4Lfe4y4cgBtaRJQEIFCW+oC506aPT2L1zw=="
   crossorigin=""></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/leaflet.heat/0.2.0/leaflet-heat.js"></script>
</head>
<body>
  <div id="uk_map_id" style="width:768px; height:1024px"></div>
  <script>
    var uk_map = L.map('uk_map_id').setView([55,-5], 6);
    var tiles = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
      attribution: '© <a href="http://osm.org/copyright">OpenStreetMap</a> contributors',
    }).addTo(uk_map);
    var heat = L.heatLayer([""" + data + """], {radius: 30}).addTo(uk_map);
  </script>
</body>
</html>
""")

That should give you a map. My map is centred on the UK as I am using UK data however you can adjust area and zoom by altering parameters in the HTML.  The line var uk_map = L.map(‘uk_map_id’).setView([55,-5], 6);  sets latitude to 55 and longitude to -5 with a zoom level of 6 for example. you may also need to multiply your parameter by a scaling factor in order to get a plot which you feel is informative. This is best done at the point you extract the data from the Pyspark dataframe.

That’s it! Now you have a heatmap in PySpark. Best of all its interactive, you can scroll and zoom it without needing to add anything further. A similar method could also be used with Pandas since its just a matter of formatting the data in the correct way for leaflet to plot.

Update: I have linked to a working version of the code on Databricks community edition which should be available for six months or so. I have also put the workbook on Github

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.