Using Shell Scripts with PySpark on Databricks

Sometimes a situation will crop up where you want access functionality in Databricks which is not readily accessible via Python. In these cases Databricks allows you to use shell scripting. Databricks shell scripting allows you to execute bash shell scripts as part of your code. There are two ways to do this each with their own strengths and weaknesses.

a snail with blue writing on it, a visual pun on shell scripting

Shell scripting within a workbook

by starting a cell with %sh it is possible to designate it as a shell script cell. You can then run a bash script within that cell. This method has the advantage of simplicity and flexibility. Its disadvantage is that the script only runs on the controlling node and thus cannot be used in distributed processes.

Why might we want to do such a thing? Well here is just one example. By default Databricks comes with a JDBC driver for accessing SQL databases. However it does not come with an ODBC driver. If you want to connect to a database using ODBC (for example using SQLAlchemy and PyODBC)) you will need to run a shell script to install the JDBC driver first. Here is such a script:

%sh
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list &amp;gt; /etc/apt/sources.list.d/mssql-release.list
sudo apt-get update
sudo ACCEPT_EULA=Y apt-get -q -y install msodbcsql17

Shell scripting via an init script

Alternatively you can run shell scripting via an init script. Before Databricks 11.3 this init script would need to be stored in cloud storage. Subsequent to Databricks 11.3, scripts are now stored in the Databricks workspace. (There is also an even more recent option to store in Unity Catalogue volumes, if preferred).

A key advantage of this method of running scripts is that the script can load data or code onto every node in the cluster which can be important if it will be required by User Defined Functions or other per thread processes. The main disadvantage is that it is less flexible since the script must be run at cluster start up.

Why might we want to use this method? the NLKT library offers an example. While the main NLTK library is easily installed as a cluster library as usual. If we try to use its extensions within a UDF we run into problems. We can solve this problem using a cluster bash script like:

!/bin/bash
pip install nltk

Once this script is saved to the workspace, it can easily be addressed via a link in the cluster’s startup scripts tab as follows.

Open the the cluster advanced options
Go to “Init Scripts” and add the file path to your new shell script in the workspace (setting the source to workspace) eg Workspace/scripts/nltk_install.sh
in the “Spark” tab of the cluster’s advanced options define an environmental variable called NLTK_DATA=”/scripts/nltk_data/”
Restart the cluster
Open a workbook and import nltk. (if you wish to check it has the path correctly run nltk.data.path to check)
Download the components you wish to be available. To download everything use nltk.download(‘all’, download_dir=”/scripts/nltk_data/”
Restart the cluster once more

Once this is done you should have access to all of NLTK’s data extensions on all clusters. Other similar methods also exist to get NLTK working on all clusters.

Using Shell Scripts with PySpark on Databricks

Shell scripting within a workbook

Shell scripting via an init script

Published by justinmatters

Leave a Reply Cancel reply