View on GitHub

imgnet-tools

Tools to download and split images from image-net.org

Easy image-net downloader and splitter

What is it ?

Image-net.org provides millions of images that are classified by type, names… This is a fabulous data source to be able to make machine learning and image recognition. This repositoy provides tools to download and split data coming from image-net website.

This repository contains 2 scripts:

Note that the dl-imgnet script makes some generic tests:

That’s mainly why I created that script: to be able to download images and avoid bad files

Any help, ideas, fixes, and so on… are gracefully appreciated :)

Requirements

As you need “requests” package, it’s recommended to use you package manager to install the python3 package:

# Fedora, CentOS, Red Hat Like...
sudo dnf install python3-requests

# Debian like, Ubuntu...
sudo apt install python3-requests

If you cannot install the required pacakge with your package manager, I provide a minimal requirements file, so just type:

pip3 install -r requirements.txt

Usage

dl-imgnet

To use that script, go to http://image-net.org, then make a search request. For example “pizza”. The website will show you several image set, you may now choose one.

Going on the result page, you now see a “wnid” in the URL. For example: “http://image-net.org/synset?wnid=n07873807”

Copy that ID, and use the command line:

./dl-imgnet.py n07873807 pizzas

The script is also able to make a search on terms:

./dl-imgnet.py pizza

In this case, a search is made on the given term, and a list of IDs with description and link is displayed - but no download is made. You can visit provided links and choose IDs.

You may use a list of IDs, if you want to merge several classification (eg. French Fries on imagenet site is splitted in 2 classes):

./dl-imgnet.py n07711080,n07711232 fries

Note the script will create a CSV file keeping downloaded image urls, md5 sum of the file, the classname and the “id” of imagenet class. This is usefull to not download already downloeded files and to keep the source of images. This file is named data.csv and can be changed with -c argument. The data.csv file provides information:

Note also, the script save image with name template <sysnet ID>.<type>, the type is defined by imghdr.what() function (python3 standard). That way, it’s possible to have several URL having the same image name, but not the same content.

It will create a pizzas directory and download all valid images inside. You can change the destination:

./dl-imgnet.py n07873807 pizzas -d base

This will create a base/pizzas directory and use it to downoad images.

You may provide others options:

splitter

Sometimes, you need to split train and validation images in separated directories. Splitter.py file will help you.

./splitter.py rep/to/pizzas 

It will create “valid” and “train” directories and copy random images from the rep/to/pizzas directory in both directories. The default fraction of image to send to “valid” is “.2” (20%).

train.csv and valid.csv are also written in the destination directory. If you launch again the splitter, so that files are updated to append others classes/files. You’ll need to delete the CSV files if you want a new clean list. This is not the same behaviour with the --all option that removes CSV files before to recreate them.

You can tell splitter to split the entire classes from a “base” directory:

./splitter.py rep/without/class -a

This will find the entire class list and split them in “train” and “valid” directories. Note that in that case, the CSV files are deleted before to be rewritten !

You may want to not copy images and only want CSV files (train.csv and valid.csv), so you can use -C or --csv-only option.

One more time, you can change destination and/or fraction to split:

./splitter.py base/pizzas -d data -f .3

This time, it split pizza images with 30% for validation, and images are copied in data/valid/pizzas and data/train/pizzas.

Of course, you can split several “base”, it will not remove the other directories.

You can change the options: