Skip to content

🦋 Using deep learning to process images of butterflies from the NHM data portal

Notifications You must be signed in to change notification settings

jangop/butterflies

 
 

Repository files navigation

Visualizing butterflies from the Natural History Museum

This repository contains code to process images of butterflies from the data portal of the Natural History Museum in London. From the processed images, a web-based, interactive, hierarchical t-SNE plot is created. The code in this repository can be used to reproduce the visualization or to extract the images and use them for something else. Pre-trained models for all three neural networks are included.

Click here for the interactive visualization.

Click here for my blog post that explains the data preparation procedure.

Usage

This section explains how to recreate the visualization on your machine. You need a GPU with at least 8 GB of VRAM and ~600 GB of disk space. The full dataset contains 716,000 images. You can use less disk space if you use a subset of the dataset, as explained below.

  1. On the NHM data portal, search for "lepidoptera". At this point, you can narrow the search if you want to work with a smaller dataset. Click the Download button and request an email with the CSV files. The CSV files will be ~1.3 GB for the full dataset.

  2. Clone this repository and unpack the files metadata.csv and occurence.csv from the data portal in the data directory.

  3. Run create_metadata.py. This will create the file metadata.csv in the data directory. The resulting CSV file contains a line for each image that will be used. You can modify the python script or the resulting CSV file if you want to work with a subset of the dataset.

  4. Run download.py. This script will download the original images into the data/raw directory. For the full dataset, this will take ~2 weeks and require 452 GB. The download speed is limited by the NHM servers, which serve around 1 file per second. You can stop the script and it will resume where you left off.

  5. Optional: Train the classifier U-Net. TODO

  6. Run create_images.py. This part removes the backgrounds and creates images with an alpha channel using the U-Net that classifies every pixel as background or foreground. This will create square PNG images of varying sizes in the data/images_alpha directory of just the butterfly for each original image. The script takes ~26 hours for the full dataset and will use in ~150 GB of disk space used. You can stop and resume this script.

  7. Optional: Train the rotation network. TODO

  8. Run scale_images.py. This creates JPG images with a resolution of 128x128 and a white background for each of the PNG images and stores them in the data/images_rotated_128 directory. It also uses the rotation network to bring the butterflies into the default rotation using the rotation network. The rotations are also saved to data/rotations_calculated.csv. You can stop and resume this script.

  9. Optional: Train the autoencoder. Run train_autoencoder.py. This runs indefinetely, until you stop it. The longer it trains, the better the result. You can run train_autoencoder.py continue to resume training on the previously trained model. You can run test_autoencoder.py to create example pairs of input and reconstructed images in the data/test directory. Stop the test script after some images have been created.

  10. Run create_latent_codes.py. This calculates latent codes for all images in the dataset.

  11. Run create_tsne.py. This calculates the t-SNE embedding.

  12. Run move_points.py. This moves points away from each other that would otherwise overlap in the visualization.

  13. Run create_tiles.py. This creates the leaflet map tiles for the visualization.

  14. Run create_json.py. This creates JSON files for the metadata that will be displayed in the web app.

  15. The files for the web app are in the server directory. You can test the web app by going in to the server directory and running python3 -m http.server. Go to the address of the server (i.e. http://0.0.0.0:8000/) to test the web app.

License

The images of the butterflies are provided by the Trustees of the Natural History Museum under a CC BY 4.0 license.

The code in this repository is provided under the MIT license.

About

🦋 Using deep learning to process images of butterflies from the NHM data portal

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 78.0%
  • HTML 22.0%