Using SQLAlchemy to Import Data to Pandas

Sometimes may want to use Python to extract data from a SQL database to analyse using pandas. There are a couple of issues here. Firstly you may want to retrieve data from more than one database table so you can’t simply work directly on the raw data from one table. Secondly the database may be very large and you want to work on a more reasonable subset of the data in pandas.

Database symbol with and arrow point to a picture of some pandas

 

To extract the data we need some way to submit queries to the SQL database and retrieve the table of results as a pandas dataframe. We may also want to be able to dynamically control the SQL query at runtime. Here is a quick run through of handy ways to do this using the SQLAlchemy library. Helpfully SQLAlchemy now supports MySQL as well.

The good news is that pandas implicitly understands and communicates with SQLalchemy. So first lets create a connection to our database using SQLAlchemy. This can be done like this:

# import the relevant python libraries

import pandas as pd
from sqlalchemy import create_engine

# set your parameters for the database
user = "user"
password = "password"
host = "abc.efg.hij.rds.amazonaws.com"
port = 3306
schema = "db_schema"

# connect to the database using sqlalchemy
conn_string = 'mysql+pymysql://{0}:{1}@{2}:{3}/{4}'.format(
user, password, host, port, schema
)
db = create_engine(conn_string)
conn = db.connect()

At the most basic level we can now send a static SQL query to our database :

# how to query the database using sql
# then return the results as a pandas dataframe

# define your database query
query = """
SELECT n.name, n.surname, a.address
FROM names AS n
INNER JOIN addresses AS a
WHERE n.id = a.name_id
"""

# access the database with the query and return a dataframe
df_names_and_addresses = pd.read_sql_query(query, conn)

What if we want to inject some parameters dynamically? No problem we can just use python’s .format command to modify the query on the fly:

# how to feed in a condition that you won't know until runtime

# assume these variables would be defined programatically at runtime
name = "John"
surname = "Smith"

# define your database query using {var} syntax for your unknowns
query = """
SELECT n.name, n.surname, a.address
FROM names AS n
INNER JOIN addresses AS a
WHERE n.id = a.name_id
AND n.name = '{name_tag}'
AND n.surname = '{surname_tag}'
"""

# now format the generic query with our runtime variables
query = query.format(name_tag = name, surname_tag = surname)

# access the database with this query and return it as a dataframe
df_name_and_address = pd.read_sql_query(query, conn)

What if we need update a list of values dynamically. Simple, just join our list into a comma delimited string and insert with .format as before:

# suppose you want to feed in a list that you won't know until runtime

# define our list for use in the sql query
user_id_list = [567, 678, 346]

# convert the list to be used to an SQL friendly comma delimited string
sql_friendly_list = ', '.join(map(str, user_id_list))

# define your database query using {var} syntax for your list
query = """
SELECT n.name, n.surname, a.address
FROM names AS n
INNER JOIN membership AS m
WHERE n.id = a.name_id
AND n.id IN ({id_list})
"""

# now format the generic query with our list
query = query.format(id_list = sql_friendly_list)

# access the database with this query and return it as a dataframe
df_name_and_address_by_id = pd.read_sql_query(query, conn)

We can also convert information like python datetimes so that they can be used in the query. For datetimes the trick is to use .strftime:

# feeding a date from python's datetime to an sql query
import datetime

# take a standard python datetime
date = datetime.datetime.now()
# convert it to sql readable format
sql_date = date.strftime("%Y%m%d")

# define your database query using {var} syntax for your date
query = """
SELECT n.name, n.surname, m.join_date, m.membership_number
FROM names AS n
INNER JOIN membership AS m
WHERE n.id = m.name_id
AND m.join_date = '{sql_date_tag}'
"""

# now format the generic query with our date
query = query.format(sql_date_tag = sql_date)

# access the database with this query and return it as a dataframe
df_memberships_for_date = pd.read_sql_query(query, conn)

If you are more familiar with SQL than with pandas there is more good news. Pydata have thoughtfully provided a handy guide to how to convert SQL queries into pandas terms. So have Hackernoon and Medium.

On the other hand if your specialities are Python and pandas then you can use those cheat sheets in reverse to work out how to write the SQL queries to feed to SQLAlchemy to get your data into pandas where you are more at home.

The next blog post will examine how we get data we have processed in pandas back into a database.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.