Recently Databricks made an exciting announcement. They have created a new library which allows you to use large language models to perform operations on PySpark dataframes. The library is called pyspark-ai and is available at https://github.com/databrickslabs/pyspark-ai/. I have tested the library out and while it still has some rough edges and can be a little brittle, it shows considerable promise.
Setup
So how do you get this to work on Databricks. Well the first requirement seems to be that you are using Databricks runtime 13.2 on your cluster. Using runtime 12.2LTS (which sadly seems to be the latest version available on the Databricks community cloud) attempting to import pyspark-ai into a workbook results in an error. `TypeError: dataclass_transform() got an unexpected keyword argument ‘field_specifiers’`
Also you will need to get an openAI API key from https://platform.openai.com/account/api-keys. This is not free, but my experiments with it so far with pyspark-ai have cost me pennies so it is economical. Once you have a key,, use it to set an environment variable as follows:
OPENAI_API_KEY=your-key-goes-here
In theory this could just be set in the cluster>configuration>advanced options>spark>environment variables setting. However Databricks recommends that you set secrets via the the Databricks CLI and the secrets API
So having set up a nice shiny new cluster, first of all you need to pip install pyspark-ai with
pip install pyspark-ai
Once this is done, you can import the library set up a SparkAI object and enable AI for you workbook as follows
from pyspark_ai import SparkAI # langchain is auto installed as a dependency of pyspark-ai from langchain.chat_models import ChatOpenAI # If 'gpt-4' is unavailable, use 'gpt-3.5-turbo' llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0) # Initialize SparkAI with the OpenAI model spark_ai = SparkAI(llm=llm, verbose=True) spark_ai.activate()
Now that things are set up we can proceed to actually using AI to assist us in work with dataframes
Displaying Data using AI
Displaying data can be achieved by using a simple df.ai.plot() command. Optionally you can add a string describing what you would like plotted in order to guide the AI’s choices, or you can leave the prompt out and just see what it chooses to do. Under the hood it appears these plots are achieved by converting to pandas and using pandas plotting libraries
# this lets the AI choose which columns to use for plotting df_wine.ai.plot() # while adding a prompt gives you extra control df_wine.ai.plot('plot total sulphur dioxide against pH')
This can give you nice looking plots with very little effort. Here is the default plot the AI gives from the wine data set which comes as standard in Databricks
Transforming Data using AI
transforming a dataframe is likewise simple to achieve as follows
df_averages = df_wine.ai.transform( """group by pH and calculate the average chlorides and sulphate for each group, then order the results by ascending pH""" )
While the AI generally seems to make a good job of this transformation, it would certainly be wise to check its work. The good news is that SparkAI returns you the SQL it used to transform your dataframe. This is useful for 2 reasons. Firstly it makes the checking of the AI’s workings easier. Secondly it means that you can extract that code to use directly meaning that you do not need to call the AI for subsequent runs of the notebook if you are creating something like an ETL pipeline.
AI powered UDFs
SparkAI also gives the ability to create custom User Defined Functions (UDFs) as follows. Here rather than passing in requirements as parameters, they are expressed as a docstring comment inside the function.
# create a spark UDF @spark_ai.udf def convert_quality(quality: int): """Convert the quality to a descriptive word where 1 would be awful and 10 would be amazing"""
Note that convert_quality actually appears to be created as a function. In the above case the function created is as follows (once again SparkAI helpfuly tells you what it has done)
# this code was what the AI generated def convert_quality(quality) -> str: if quality is not None: if quality == 1: return "awful" elif quality == 2: return "bad" elif quality == 3: return "poor" elif quality == 4: return "fair" elif quality == 5: return "average" elif quality == 6: return "good" elif quality == 7: return "very good" elif quality == 8: return "excellent" elif quality == 9: return "outstanding" elif quality == 10: return "amazing"
The code can then be successfully used as a UDF as follows
from pyspark.sql import functions as fn # add a udf wrapper to the AI created function df_wine_descriptions = df_wine.withColumn( 'description', fn.udf(convert_quality)('quality') )
Once again the provided code can be extracted for analysis and for direct use to reduce OpenAI AIP calls (and thus cost).
Loading data from the web with AI assistance
Data can also be loaded from the web. This can either be done by providing a URL of a page on which a table you wish to use is located, or (if you provide a google API key) by providing a search string for the AI to use to locate the data you want
# if you have the URL for the data you can submit that df_auto = spark_ai.create_df( "https://www.carpro.com/blog/full-year-2022-national-auto-sales-by-brand" ) # alternatively use google to search for data df_auto = spark_ai.create_df( "2022 USA national auto sales by brand" )
Failure Cases and Overall Impressions
SparkAI is quite new and it does have some weaknesses and failure cases. These include:
- Some transforms seem to fail where column names have spaces in them. This is because the spaces lead to incorrect syntax in the SQL the AI generates
- Some data loading from the web fails. This happens even when you hint what columns you are looking for. Two factors seem to be present here. Too much data for the AI response to handle, and difficulties identifying where the data is on the webpage so it can be extracted
- the UDF creator does not actually seem to create a UDF. rather it creates a function which can be converter to a udf by wrapping it in spark.sql.functions.udf()
The first two issues are likely model dependent (I was using chatGPT 3.5) and may well be fixed by improved models over time or better prompt engineering. The third issue is likely just an issue with how the UDF wrapper for the function is created and there is a simple workaround as noted above.
Overall this is an intriguing addition to PySpark and if it continues to be developed then it could turn into a very powerful data analysis tool.
Sample Notebook and Further Reading
It would appear that Databricks no longer supports publishing community workbooks directly, and furthermore the workbook does not work there currently so my example workbook can be found on Github. The workbook details a number of successful uses along with some failure cases. It also provides usage examples. These may be handy for getting started since some of the library’s usage conventions are not immediately obvious.