This project is made by Anish Govind. Other projects can be found at my GitHub. Huge thank you to the contributors behind this dataset as without it this project wouldn't be published
The purpose of this project is to classify between 10 species of butterflies based of a picture of a butterfly. Butterflies are important and crucial to our ecosystem. They are vital in pollination, as they spread pollen when they collected nectar, and the ability to classify between them would lead to knowing which butterflies are more useful in pollination.
These are the supported species by this dataset and model:
| Scientific Name | Common Name |
|---|---|
| Heliconius Erato | Crimson-patched Longwing |
| Heliconius charitonius | Zebra Longwing |
| Danaus plexippus | Monarch |
| Lycaena phlaeas | American Copper |
| Junonia coenia | Common Buckeye |
| Vanessa cardui | Painted Lady |
| Nymphalis antiopa | Mourning Cloak |
| Papilio cresphontes | Giant Swallowtail |
| Vanessa atalanta | Red Admiral |
| Pieris rapae | Cabbage White |
Any other species are not supported by this dataset and therefore this model! Running the model based off another species will not produce an error, but it will provide an incorrect prediction!
- Inferential Statistics
- Deep Learning
- Convolutional Neural Networks
- Python
- Pandas
- Jupyter & Matplotlib
- Numpy
- Sci-Kit Learn (for data handling)
- Keras & Tensorflow
This project was developed off the data (provided here) which consisted
of 10 text files and ~800 images of butterflies. The data is marked in the way that the file name is the first 3 digits of the
image file (a butterfly from 001.txt would have image 0010001.png). Armed with this, a couple of iterative sequences is able
to populate a dictionary with 2 pieces of information: Scientific Name and the Image Array.
Initially, the dictionary was structured as such:
my_dict = {
"Names": ["Monarch", "Crimson-patched Longwing", ...],
"Scientific Names": ["Danaus plexippus", "Heliconius erato", ...],
"Image Files": [["0010001.png", ...], ...], # List of lists of image files that relate to the iterated species
"Image Arrays": [
[
[[[...]]], ... # 3-D Image matrices from matplotlib.image.imread. Resized to be 128x128.
],
[
[[[...]]], ... # 3-D Image matrices from matplotlib.image.imread. Resized to be 128x128.
],
...
]
}And would create a DataFrame similar to the following:
| Names | Scientific Names | Image Files | Image Arrays |
|---|---|---|---|
| "Monarch" | "Danaus plexippus" | ["0010001.png", ...] | [[[...]]] |
| "Crimson Patched Longwing" | "Heloconius erato" | ... | [[[...]]] |
| ... | ... | ... | ... |
This would be good, for normal database handling, but in terms of running models off this data, it was a terrible way to do this. Instead, we should try to have each input have an output, instead of having one output containing multiple inputs. Realizing this was when I was developing the model itself, and running into errors like dimension mismatch. Nonetheless I figured it out and now have a dictionary structured like:
my_dict = {
"Scientific Name": ["Danaus plexippus", "Danaus plexippus", ..., "Heliconius erato", ...], # Notice how there are multiple occurrences of the same species this time
"Image Arrays": [[[[...]]], [[[...]]], ..., [[[...]]], ...] # Notice how this is a list of 3-D Image matrices from matplotlib.image.imread resized to be 128x128,
# instead of a list of 4-D matrices.
}I wouldn't need to record the image files anywhere, as it isn't used for input, so I dropped it. This would then create a DataFrame similar to:
| Scientific Name | Image Arrays |
|---|---|
| "Danaus plexippus" | [[[...]]] |
| "Danaus plexippus" | [[[...]]] |
| ... | ... |
| "Heliconius erato" | [[[...]]] |
| ... | ... |
Perfect! Now I can map each input of the image array to a specific species! After handling my data the correct way, I
saved it as .npy files in the data/ directory. This would prevent us from having to create a DataFrame every time we want to predict from an image
as I can just call the file later on and store that to my model.py.
I trained the model using Keras and used multiple convolutional neurons to make a reliable model. After training for a few short
minutes, I would save the model into butterfly_classifier.h5. Saving the model there would allow us to make predictions quicker
for the same reasons above.
In main.py we handle the prediction and other possible errors. This is the part where the user interacts with the program, so you'll
bear witness to multiple checks and balances in the forms of try/except blocks. But if we were to allow users to send in whatever image
they want, then we would likely have shape mismatch errors (as in feeding my model an array of size (1440, 1080, 3) when the input only accepts (128, 128, 3).
To ensure the user won't break everything, I have added the same way to resize the image down.Also in main.py I have added a way to reverse the one-hot
encoding that took place in model.py as if the program just returned an index number for the prediction matrix, it would not be a good program.
Through careful analysis (and by that I meant playing in my Data Processing.ipynb notebook) I figured out how the array was one-hot
encoded and reversed it to provide the correct species matched to the input image. Adding a way to get user input (by the means of sys.argv)
I have completed the rough outline of the project!
What this project was designed for was to mainly test my knowledge about machine learning and whether or not I can make a reliable model based off a dataset. Instead of tackling the MNIST sets, I wanted to do something different but just as similar. Typical MNIST dataset tutorials give you a preprocessed dataset all in one line of code. Where's the fun in that? Where's the data pre-processing? I opted for doing all that by myself and it seemed to work in my favor.
- More pictures of different species (more important)
- More pictures of the same species (lower priority)
- Clone this repo (for help see this tutorial).
- Download the dataset here and unzip the file in the
data/folder of this repo. **DO NOT CHANGE THE FOLDER NAME FROM **leedsbutterfly/ conda create venvconda activate venvconda install -r requirements.txtpython main.py ~/PATH/TO/YOUR/IMAGE
Note: The label_arrays.npy, image_arrays.npy, and butterfly_classifier.h5 files are not provided while cloning the repository.
This means that upon the first run, the script will automatically generate those files, but it will consume time and resources.
To get access to these files and quickly run the model, check out the releases.
If you do decide to download the files and run the classifier, you don't need
to use the dataset, and can safely delete it.
| Argument | Output |
|---|---|
~/PATH/TO/YOUR/IMAGE |
["Scientific Name", "Common Name", "Brief Description"] |
-h --help |
Show a help message and exit |
-r --retrain |
Retrains the model and runs the new model to predict off the given image. The new model gets saved in a file. |
-b --batch_size |
Changes the default batch size from 128 to user input |
-e --epochs |
Changes the default epoch size from 12 to user input |
-t --test_size |
Changes the default test proportion from 0.2 (20%) to user input. |
- Preprocess dataset
- Create the model and save it
- Make a working script to run everything and predict
- Add the ability to retrain the model with new user input
- Data augmentation to improve accuracy of the models
- Make a website that allows you to add images or take images of butterflies and tells you what species they are.
- 1.0.0 (2/9/2020): First working release. More to come soon 😄
- 1.2.0 (2/11/2020): Added command line arguments.
Creator: Anish Govind
Ways to contact:
IF YOU FIND ANY ISSUES OR BUGS PLEASE OPEN AN ISSUE