Starting with Great Expectations in Pyspark Notebooks

Great Expectations (GE) is pipeline testing tool for use with Pandas and Spark dataframes. It is useful for checking that data flowing through your pipeline conforms to certain basic expectations. This should ensure that your pipeline runs smoothly and does not suffer from garbage in garbage out.

Great Expectations - a Dickens book and now a Python Library

To take full advantage of Great Expectations you need to set up a data context, connect data sources and perform other preparatory work. This allows GE to store metadata about your data. However setting up a data context is not  trivial . It would be nice if you could experiment with Great Expectations before taking the full plunge. It would also be nice to be able to do simple tests on the fly in workbooks.

Fortunately Great Expectations does detail a way for you to start experimenting with with Great Expectations and Pandas dataframes in a workbook here. Unfortunately they did not provide a similar quickstart guide for using GE with spark data frames in a work book so here is an explanation of how to do this.

Installation

First you will need to pip install great_expectations. Then inside the workbook you will need to import great_expectations as ge. Once you have done this grab you data and stuff it into a spark dataframe. Note that in Databricks you can install from within a workbook using dbutils.library.installPyPI("great_expectations")

Next you need to convert your dataframe into a a great expectations object. For Pandas this would be df_ge = df.from_pandas(df) however for a Spark dataframe it is df_ge = ge.dataset.SparkDFDataset(df).

Running Tests

Once you have done this you can run Great Expectations tests. A list of the available expectations can be found here (its worth noting that GE also works with SQL tables). Further details on the methods are given here.  Expectations return a dictionary of metadata, including a boolean “success” value.

One important thing you should understand is that Great Expectations handles Pandas dataframes and Spark dataframes a little differently. With Pandas, the Great Expectations object is a PandasDataset which inherits all of the Pandas methods while adding Great Expectations methods. With Spark things are a little different. You cannot directly address the dataframe as part of the dataset. However you can access the original dataframe using df_ge.spark_df.

Further Resources

You can find a very basic example workbook demonstrating how to get started with Great Expectations in a Databricks workbook here or you can download it from Github if the Databricks community link has expired.

If your experiments with Great Expectations seem promising you can go ahead and set up a data context to allow use of more advanced features like data profiling and automated validation. You can find details on how to instantiate a Data Context on Databricks to continue your GE journey here. Another good overview can be found here.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.