Getting jobs with bespoke libraries to run programatically on Databricks

Previously I have scheduled PySpark jobs using Airflow, Papermill and the Databricks Jobs API. However Databricks has an automation system built in for running automated jobs on a schedule which can be useful for your simple automation tasks. You can find the documentation here: https://docs.databricks.com/jobs.html

I decided to use this built in scheduler to simplify the workflow on a daily batch process recently. However I ran across a small wrinkle which surprised me. I am providing the solution here in the hope it might save a few people some googling and puzzling over documentation.

The Problem

When you create a cluster in Databricks you can usually edit the cluster and add libraries via the Libraries tab. This is useful if you want to use unusual or bespoke libraries in your code.

Standard cluster creation with wide variety of library install options

When you go to set up a job in Databricks you define a cluster. However this cluster definition does not give you a libraries tab. Some poking about leads you to the following documentation https://kb.databricks.com/jobs/job-fails-no-library. This suggests that you instead define the required libraries as dependent libraries in the job definition. This seems reasonable, but when the job definition allows you to specify dependent libraries it is far more restrictive in how you can install the library.

Job creation dependent library options are very limited

So how do you get jobs to run custom libraries from PyPi or Maven? You might think that you can simply point the job at one of your previously defined clusters, with their wider range of library options, but this is bad for two reasons.

  1. These are interactive clusters so they cost more than a scheduled job cluster should
  2. The jobs documents warn you that when running programmatic jobs you cannot be certain that the libraries will be available since the code may start running before the libraries have time to install!

The Solution

Clearly we need to do something else. I discovered that the simplest solution is to install the libraries to the shared workspace. You can find this described here https://docs.databricks.com/libraries/workspace-libraries.html. You can then use the job setup page to add them as dependent libraries from the workspace. Your dependent libraries always load before any job starts to run. This avoids the availability issue noted above.

You install to the shared workspace by selecting Workspace. Then you hover over shared then select Create>Library. The window that pops up will resemble the options you get when creating an interactive cluster including Maven and PyPi

As a side benefit it used to be possible to specify that a library installed be automatically installed on all clusters preventing dependency issues if desired. However this no longer works with clusters on runtime >= 7.0.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.