Writing neat JSON output with PySpark on Databricks

When you ask Databricks to write a JSON file, you may be surprised by the results. Rather than writing a simple individual JSON file, instead you get a folder at the location you specified containing some logging files and one or more JSON files with long unpredictable names.

If you intend to read the data back in via Databricks at a later date this is not a concern. You can simply read the multiple JSON files in the folder back in by providing the same path. However if you want to read the JSON files elsewhere, Databricks output format may be less useful. So how do we rectify this?

Understanding the Issues

First things first, why do we get multiple JSON files? This is because PySpark is executing your dataframe operations across multiple threads, this means that when it comes to write out each thread is written to a separate file. This is great for speed, but may not be desirable for end users, who might want a single file.

Secondly why do we get a folder not a file. well this is Databricks trying to help us. If we are getting multiple JSON files written, they are much easier to locate if grouped in a folder, and the logging files can be useful in case of issues. However we just want a single simple JSON file, so how do we get that?

The Solution

Fortunately we can restrict the output to a single file by simply coalescing the data onto a single thread (so long as the data is not too big to fit in the memory of a single node). However Databricks will still place this file in a folder with the logs which is not what we want. There is no way we can tell PySpark not to do this, however we can fix it after the fact.

We can use Databricks dbutils package to do a little file wrangling. We want to get the filename of the written JSON file. Next we can move that file to where we want it and rename it. Then finally we can delete the unwanted folder with the logs. Example code showing how this is done is as follows:

temp_path = folder_path + file_name +'_temp'

# convert to single file and write out
df_out.coalesce(
    1
).write.format(
    'json'
).mode(
    "overwrite"
).save(
    temp_path
)

# databricks writes a folder with metadata so we need to reorganise
list_files = dbutils.fs.ls(temp_path)

# find the first (only) json file
json_file = [
    file.path for file in list_files if '.json' in file.path
][0]

# move and rename it
dbutils.fs.mv(json_file, folder_path + file_name +'.json')

# delete everything else
dbutils.fs.rm(temp_path, True)

The good news is that PySpark can read individual JSON files, so doing this means the file is convenient for other uses but can also be read back into Databricks easily. When reading the data remember PySpark dataframes are written to JSON as JSON Lines format files not JSON format files. Also if you wish to write to gzipped files to save space, you can do this by simply adding an option to the code which writes out the file and you will get a .gz file instead of a .json file

temp_path = folder_path + file_name +'_temp'

# convert to single file and write out
df_out.coalesce(
 1
).write.format(
    'json'
).option(
    "compression",
    "gzip",
).mode(
    "overwrite"
).save(
    temp_path
)

# databricks writes a folder with metadata so we need to reorganise
list_files = dbutils.fs.ls(temp_path)

# find the first (only) gzipped json file
json_file = [
    file.path for file in list_files if '.json.gz' in file.path
][0]

# move and rename it
dbutils.fs.mv(json_file, folder_path + file_name +'.json.gz')

# delete everything else
dbutils.fs.rm(temp_path, True)

Writing neat JSON output with PySpark on Databricks

Understanding the Issues

The Solution

Published by justinmatters

Leave a Reply Cancel reply