Before data can be used for machine learning, you need to ensure that it is correctly prepared. This can involve a number of steps including:
- Selection of algorithm
- Cleaning the data of errors and duplicates
- Data augmentation
- Formatting the data correctly for the algorithm
The choice of algorithm will play a large part in determining what steps need to be taken. As even our largest category of images is only a few hundred images in size we will need to be smart in our selection of algorithm. Ideally rather than training from scratch, we might well want to use transfer learning to get the most predictive value out of our dataset without needing to resort to heroic levels of data augmentation or further data collection. Transfer learning works by using several layers from a pretrained high quality machine learning model and joining them to a few new layers leading to the output layer. The pre-existing network extracts features and we only need to train our rather simple “add-on” layers on our images in order to get good predictions.

A quick search online and we can find an excellent candidate for our problem. One possibility might be a solution based around ResNet_18, however Keras has an inbuilt pretrained model called VGG16 which has been used in similar problems in the past. Excellent examples include both binary class and multi class implementations. We shall pattern our solution after the multiclass implementation which has a useful code library on Github.
In the case of the LEGO image data we have already eliminated the duplicates and bad data during the web scraping process detailed in part 2. This is generally preferable as while identifying duplicate data can be reasonably simple, checking data for correct tags and freedom from other errors can be difficult. Putting in the thought to acquire a good quality data set in the first place will generally work out more efficient in the long run.
Data Augmentation can be necessary where the data set is insufficiently large. For image data this generally consists of applying a variety of modifications such as noise, crops, rotations and stretches to duplicate images from the training data to create additional training images. However as we are planning to use transfer learning, our small sample sizes should not be such a difficulty in this instance. For now we will not use data augmentation on our dataset directly. We are indirectly leveraging a similar principle with transfer learning.
Finally we need to format the data correctly for the algorithm. Looking over the implementation of the VGG16 network we notice a number of important requirements:
- Image size of 224 x 224
- Image locations must follow a particular folder structure
- data labels must be provided in the folder names
Lets address these issues using some Python. First lets resize the images using pillow to 256 x 256. We could leave them as they are and let Keras take care of the resizing, but this is a useful way to dip into Pillow a useful update to the old Python Image Library which may be of use on other projects. Note that we don’t rescale to 224 x 224. This means that if we later choose to use data augmentation techniques, we will have a border to allow for minor crops, rotations and other distortions. this will require some libraries.
#!/usr/bin/python import os import sys import random from PIL import Image # used for image resizing import pickle # used for retrieving saved data import shutil # used for rapid image copying import collections # used for checking category sizes
Note that while you conda install -c conda-forge pillow or pip install pillow when you import it in Python the command is PIL in common with the old PIL library. Now on with the resize.
# resize the images in a specified path and write them back to a different path
# this is slightly larger than the 224x224 dimensions needed for VGG16
# this allows for data augmentation later if required
width = 256
height = 256
in_path = "C:/Users/Justin/Pictures/Lego/thumbnails/"
out_path = "C:/Users/Justin/Pictures/Lego/preprocessed/"
os.chdir(in_path)
contents = os.listdir(in_path)
def resize():
counter = 0
for item in contents:
# check that the file in question is a file not a folder
if os.path.isfile(in_path+item):
counter +=1
im = Image.open(in_path+item)
imResize = im.resize((width,height), Image.ANTIALIAS)
imResize.save(out_path + item, 'JPEG', quality=90)
# tell us how many images we managed to resize
print (counter)
resize()
We will need the metadata and labels that we stored for the images at the end of part 2
# Getting back the metadata for the images
path = "C:/Users/Justin/Pictures/Lego/"
os.chdir(path)
with open('LegoDataClean.pkl', 'rb') as f: # Python 3: open(..., 'rb')
imported_dataset = pickle.load(f)
# check data is what we expect
print (len(imported_dataset))
print (imported_dataset[30:35])
<pre>
Now that we have the data we can check the sizes of our categories. Some of the classes may prove to be very small, we will look for the 20 biggest categories for our project
# create a dictionary of categories and number of members categories_extracted = [x[2] for x in imported_dataset] counted_categories =collections.Counter(categories_extracted) print (counted_categories)
Next we need to randomise our data to ensure that our training, validation and test images are drawn from the same distributions. Then we need to exclude all the categories that we do not intend to use from the updated dataset
# shuffle up our examples to ensure that train, validation and test sets have similar distributions
random.shuffle(imported_dataset)
# then sort by set type
set_type_dataset = sorted(imported_dataset, key=lambda x: x[2])
# create a new training set list which only contains items in a subsetted list
limited_list = ['Duplo', 'Star', 'City', 'Creator', 'Bionicle', 'Ninjago', 'Town',
'Racers', 'Technic', 'Castle', 'System', 'LEGOLAND', 'Space', 'Sports',
'Explore', 'Trains', 'Fabuland', 'HERO', 'Marvel', 'DC']
#create a new list which only contains entries from our chosen categories
limited_dataset = []
for row in set_type_dataset:
if row[2] in limited_list:
limited_dataset += [row]
# check our new dataset looks plausible
print (limited_dataset[30:35])
print (len(limited_dataset))
Next we set up the base folders for our train, validation and test examples. Once these are in place we can use a script to create the relevant category subfolders in each folder as follows
# set up our keras folders
# specify base directories
source_directory = "C:/Users/Justin/Pictures/Lego/preprocessed/"
train_directory = "C:/Users/Justin/Pictures/Lego/data/train/"
validation_directory = "C:/Users/Justin/Pictures/Lego/data/validation/"
test_directory = "C:/Users/Justin/Pictures/Lego/data/test/"
# need to extract the categories to set up appropriate folders in train, test and validate
categories = []
for row in limited_dataset:
categories += [row[2]]
categories = list(set(categories))
# check our list is correct
print (categories)
for category in categories:
os.makedirs(os.path.dirname(train_directory+category+"/"), exist_ok=True)
os.makedirs(os.path.dirname(test_directory+category+"/"), exist_ok=True)
os.makedirs(os.path.dirname(validation_directory+category+"/"), exist_ok=True)
Now all we need to do is copy the approprite images into the relevant category subfolders of the train, validation and test sets
# load the various classes of images and place them in train validation and test folders in an 8:1:1 ratio import shutil row_number = 0 def assign_image(image_name, source, target): shutil.copyfile(source + image_name, target + image_name) #image = Image.open(source_directory+image_name) #image.save(target_directory + image_name, 'JPEG', quality=90) # cycle over dataset rows for row in limited_dataset: # assign 1/10 to validations, 1/10 to test and the rest to train if row_number % 10 == 0: # note generators for filename AND also for destination directory assign_image(row[0] + "_" + row[2] + "_" + row[4] + '.jpg', source_directory, validation_directory+row[2]+"/") elif (row_number - 1) % 10 == 0: assign_image(row[0] + "_" + row[2] + "_" + row[4] + '.jpg', source_directory, test_directory+row[2]+"/") else: assign_image(row[0] + "_" + row[2] + "_" + row[4] + '.jpg', source_directory, train_directory+row[2]+"/") row_number += 1
The above code is also available from Github.
Now we are ready to start using the images we scraped to train a model.