Image Recognition 2 of 4 - Using Beautiful Soup to Extract Webpage Information for a Data Set

The first step in training a machine learning algorithm is to acquire a suitable data set to train, validate and test the model. As we wish to categorise LEGO sets by type, we will need a large number of images of different LEGO sets. In addition we will need consistent “tags” to tell us (and more importantly our machine learning system) what each image contains.

So first lets take a look at what is a available as possible resources for this task online. A quick Google search for “Lego Sets Categories” retrieves a number of possibilities:

A closer look at these rapidly rules out Wikipedia (no images) and Brick Instructions (poor metadata). The official Lego site is well categorised, but the Brickset site has the best meta data:

illustrative image from https://brickset.com/sets/theme-Bionicle — Pretty definitely a bionicle

note the set number, set type, year and additional tags just under the title. These will be ideal for labelling our data to train our machine learning algorithm if we can just extract them.

An excellent article has been written about using Python and Scrapy to extract information from Brickset. (Update 2019: Another excellent article on setting up and using Scrapy can be found on Like Geeks.) However I am going to use Beautiful Soup with Python so that you can compare and contrast the two different web scraping methods and make your own decision about which you prefer. Files for this project can be found on Github

I conducted this project using Python 3, Anaconda, Jupyter Notebooks and a Chrome web browser on a Windows 10 machine, but many other setups would work just as well. Lets start by importing urllib to handle downloading from the internet, os to handle the system side saving of data that we scrape and BeautifulSoup to help transfer data in a meaningful fashion from one to the other. Later on we will also be using pickle to backup data so you may want that installed as well.


#import the libraries we will need

import urllib
from bs4 import BeautifulSoup as soup
import os

path = "C:/Users/Justin/Pictures/Lego"
baseURL = "https://brickset.com/sets/random/"
baseURL = "https://brickset.com/browse/sets"

# initialise the program path
os.chdir(path)

OK now we have the dependencies and paths setup , lets take a look at the structure of the Brickset pages. Taking a little look around we find a super useful page at https://brickset.com/browse/sets. This page seems like a great jumping off point to spider this webpage from as it has explicit links to the top pages of all the category listings. Lets download this page using urllib.

#create a list to hold our return information
soup_list = []

# open with urrllib (note this form is python 3 specific)
req = urllib.request.Request(baseURL)
opened = urllib.request.urlopen(req)
page_HTML = opened.read()
opened.close()
# convert HTML to a soup object for parsing
soup_list += [soup(page_HTML, "html.parser")]

print (len(soup_list))

This code should return 1 as we have downloaded one page from the web. Now we can inspect the contents of the web page using commands like:


# we can examine the text
print (soup_list[0].text)
# we can examine the html
print (soup_list[0].prettify())

However if we want to parse this page to extract the category links it may be easier to inspect the webpage source interactively in Chrome or another browser which allows this. In Chrome right click on a link(1) and select inspect source (2) from the drop down menu:

Examining the html tags around this allow us to work out how to isolate and extract the category links we will require for the next step. In particular looking in the source inspection panel we note that the links are contained within anchor tags.

We could also want to restrict which links we extract in other ways, for instance we may want to ignore the year specific categories at the top of the page. This can be done by cycling over all the links we extract from anchor tags and applying a further set of criteria, one to make sure it is a theme link and one to make sure it is not year specific. We’ll need to inspect a few links and look for common patters. This gives us code like this


# extract themes
theme_links = []
# find link blocks
for link in soup_list[0].find_all('a'):
    # extract the relative link
    link_text = link.get('href')
    # reject extraneous links on page 
    # (non theme links and theme links specific to 2018)
    if (link_text[:11] == "/sets/theme") and (link_text[-9:-5] != "year"):
        #store the full URL not just the relative
        theme_links += ["https://brickset.com"+link_text] 

# check that the output is as desired
print(len(theme_links))
print(theme_links[0:10])

Note the checks at the bottom which allow us to check that we are extracting what we expect, if we don’t then the criteria we sort by can be adjusted.

We are now in a position where we can extract the first page from each category using our list of URLs


#create a list to hold our return information
theme_soup_list = []

# open and copy the desired pages
# can insert numbers into [0:] during testing eg [0:5]
for theme in theme_links[0:]: 
    # open with urrllib (note different in python 2 and 3)
    req = urllib.request.Request(theme)
    opened = urllib.request.urlopen(req)
    page_HTML = opened.read()
    opened.close()
    # convert HTML to a soup object for parsing
    theme_soup_list += [soup(page_HTML, "html.parser")]

print (len(theme_soup_list))

Looking at one of these pages ourselves, we can see that it is only the first 25 entries but that it helpfully states how many total items there are in that category.

So we do some sums and look at the general patterns of the URLs and create a list of every URL from which we want to download data


# the set_soup_list url only retrieves the first 25 matches
# lets find out how many results there are in each set and create a more complete list
complete_URL_list = []
for i in range(len(theme_soup_list)): # need to use iterator so we can also iterate over theme_links
    # we need to extract the number of matches in the current group
    matches_text = (theme_soup_list[i].find('div', class_='results').text)
    matches_list = matches_text.split()
    matches = int(matches_list[4])
    # need pages to be two higher than the actual number to account for python for loops and partial pages
    pages = (int(matches/25)+2) 
    # not lets create a new more complete list of pagers to spider
    complete_URL_list += [theme_links[i]]
    for j in range(2,pages):
        complete_URL_list += [theme_links[i]+"/page-" +str(j)]

print (len(complete_URL_list))
print (complete_URL_list[0:10])

Once this is done we are in a position to scrape the data we want. We shall do this in two stages. First we shall scrape the metadata and the links to the related images we want to download but not the images themselves.


# roll up extraction of thumbnail address, main link, name and metadata list into one loop

def getSoup (targetURL):
    req = urllib.request.Request(targetURL)
    opened = urllib.request.urlopen(req)
    page_HTML = opened.read()
    opened.close()
    # convert HTML to a soup object for parsing
    return soup(page_HTML, "html.parser")

def processSoup (soup_in):
    processed_tuple_list = []
    matches = (soup_in.find_all('article', class_='set'))
    #print(len(matches))
    for match in matches[0:]:
        # get the image url for the set
        set_thumb_URL = match.find('img')['src']
        # get the title and split off the set name
        title = match.find('img')['title']
        set_name = title.split(':')[1].strip()
        # get the tags for the set
        tag_soups = match.find_all('div', class_='tags')
        primary_tags = tag_soups[0].text.strip().split()
        set_number = primary_tags[0] # first primary tag gives more detail than the split of the name
        set_type = primary_tags[1]
        year = primary_tags[-1] # always the last primary tag
        secondary_tags = tag_soups[-1].text.strip().split(" ")
        # make a tuple of the set_number, set_name, set_type, set_thumb_URL, year, secondary_tags
        item_tuple = (set_number, set_name, set_type, set_thumb_URL, year, secondary_tags)
        processed_tuple_list += [item_tuple]
    return processed_tuple_list

#cycle over all the scraped pages however do not preserve the soups as that would get huge
#iterate over them processing as you go instead
lego_set_tuple_list = []
for page_URL in complete_URL_list[0:]:
    soup_to_process = getSoup(page_URL)
    lego_set_tuple_list += processSoup(soup_to_process)

# check that the output is as desired
print (len(lego_set_tuple_list))
print (lego_set_tuple_list[5:10])

This crawl takes some time, so we do not want to lose this data, lets save it to our computer with pickle. This allows you to scrape the images at a later date if desired (though leave it too long and you may get some missing data of course).


# we can use pickle to save a more readily python readable form of our data
import pickle

path = "C:/Users/Justin/Pictures/Lego/"
os.chdir(path)

with open('LegoData2.pkl', 'wb') as f: # Python 3: open(..., 'wb')
    pickle.dump(lego_set_tuple_list, f)

Now we can load in our list of metadata and work on it safe in the knowledge that we have a backup


# Getting back the objects:
path = "C:/Users/Justin/Pictures/Lego/"
os.chdir(path)
with open('LegoData2.pkl', 'rb') as f: # Python 3: open(..., 'rb')
    spare_copy = pickle.load(f)

print (len(spare_copy))
print (spare_copy[30:35])

At this point you could also check for duplicates and anomalies in the data and eliminate these issues before downloading images. A simple way to check for duplicates would be as follows


# check for duplicates
list_spare = [i[0] for i in spare_copy]
set_spare = set(list_spare)
print(len(set_spare))
# acceptably small number of duplicates (about 50 out of 15000 = 0.4%)

We can also remove items from the list that don’t point at a valid jpg or are duplicates as follows

# create a cleansed set list
intermediate_set_list = []
cleansed_set_list = []
#first remove any lines where the URL is not a .jpg
for lego_set in lego_set_tuple_list[0:]:
    #print (lego_set[3][-17:-13])
    if lego_set[3][-17:-13] == '.jpg':
    intermediate_set_list += [lego_set]

print (len(intermediate_set_list))

# remove duplicate entries
for lego_set in intermediate_set_list:
    if lego_set[0] in set_spare:
        set_spare.remove(lego_set[0])
        cleansed_set_list += [lego_set]

print (len(cleansed_set_list))

Once we have our list of metadata and image URLs, we shall use the image URLs to download the related pictures. For image URLs I selected the thumbnails as we are expecting a lot of data items to be returned and want to keep the dataset to a manageable size, reduce download times and not overload the Brickset site.


# download thethumbnails
path = "C:/Users/Justin/Pictures/Lego/thumbnails/"
os.chdir(path)
for lego_set in cleansed_set_list[0:5]: # can insert numbers into [0:] during testing to shorten run times
    #open and name file
    imagefile = open(lego_set[0] + "_" + lego_set[2] + "_" + lego_set[4] + '.jpg', "wb")
    #open url and write to file
    imagefile.write(urllib.request.urlopen(lego_set[3]).read())
    # close file
    imagefile.close()

Note that this procedure does not check if the jpg is corrupt or has any other issues, it just downloads it blindly to your local machine. Success! We now have a downloaded data set of LEGO images and a list of metadata.

Finally lets save the cleaned up deduplicated metadata list to file for use in the next step


# finally lets pickle the cleaned data
import pickle

path = "C:/Users/Justin/Pictures/Lego/"
os.chdir(path)

with open('LegoDataClean.pkl', 'wb') as f: # Python 3 specific
pickle.dump(cleansed_set_list, f)

In the next section we will select a machine learning algorithm and prepare our downloaded images for training

Closing notes hints and tips

Hints for Beautiful Soup:

Be sure to make full use of the html tag structure of the page to specify what you want to extract.
Don’t be scared to combine Beautiful Soup’s parsing with more conventional parsing like string operations or even regular expressions.
Print samples of your output often to check you are getting what you plan to extract
.find, and .findall are your biggest friends
returns from finds are dictionaries and you can index them as such (eg set_thumb_URL = match.find(‘img’)[‘src’]), you can also use .get
use Chrome or similar to inspect the source side by side with trying to write your parser

General Hints:

When preparing to scrape large quantities of data off a site, start with small subsets to test your methodology. One of the simplest ways to do this is to use a small list subset in loops:


for theme in theme_links[0:10]:

Then when you are ready to complete a full run, simply remove the number after the colon to process the whole list through the loop


for theme in theme_links[0:]:

This kind of coding is quite exploratory, as such you may find it helpful to write a list of comments detailing the steps you are trying to achieve and then use these as a guide to build your code,. Added advantage, commented code!

Cheeky urllib workaround:

Not all sites like the urllib download agent. In this case we can tell them that we are a different agent:


# declare a user agent to dissuade the site from bouncing us
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
headers = { 'User-Agent' : user_agent }

# open and copy the desired pages
# open with urrllib (note this form is python 3 specific)
req = urllib.request.Request(baseURL), None, headers)

2 Replies to “Image Recognition 2 of 4 – Using Beautiful Soup to Extract Webpage Information for a Data Set”

Huw Millington says:

June 10, 2019 at 9:59 pm

Great article, but totally unnecessary in the case of extracting data from Brickset because there’s a SOAP API and even a CSV download that provides all the data 🙂

Huw
Owner, Brickset.com

1. justinmatters says:
  
  July 9, 2019 at 5:24 pm
  
  Excellent point Huw. However the article was designed to demonstrate the use of Beautiful Soup as part of a machine learning pipeline rather than to go about things in the most efficient way possible. Realistically if you have a picture of some Lego, looking on Brickset directly is likely to identify it faster and more accurately than building a machine learning algorithm like this one.

Image Recognition 2 of 4 – Using Beautiful Soup to Extract Webpage Information for a Data Set

Closing notes hints and tips

Published by justinmatters

2 Replies to “Image Recognition 2 of 4 – Using Beautiful Soup to Extract Webpage Information for a Data Set”

Leave a Reply Cancel reply