Easy image-net downloader and splitter
What is it ?
Image-net.org provides millions of images that are classified by type, names… This is a fabulous data source to be able to make machine learning and image recognition. This repositoy provides tools to download and split data coming from image-net website.
This repository contains 2 scripts:
dl-imgnet.pyto download a image-net.org image set using the ID found in the search enginesplitter.pyto split images in train and valid directories
Note that the dl-imgnet script makes some generic tests:
- is the image a real “image” ?
- is it a known “bad image” (for example, sometimes, flickr returns a “logo” to say that image is not found, with a 200 status code)
That’s mainly why I created that script: to be able to download images and avoid bad files
Any help, ideas, fixes, and so on… are gracefully appreciated :)
Requirements
- Python 3
- Pip 3 (to install requirements if your package manager doesn’t provide
python3-requests) - Requests package
- Internt connection…
As you need “requests” package, it’s recommended to use you package manager to install the python3 package:
# Fedora, CentOS, Red Hat Like...
sudo dnf install python3-requests
# Debian like, Ubuntu...
sudo apt install python3-requests
If you cannot install the required pacakge with your package manager, I provide a minimal requirements file, so just type:
pip3 install -r requirements.txt
Usage
dl-imgnet
To use that script, go to http://image-net.org, then make a search request. For example “pizza”. The website will show you several image set, you may now choose one.
Going on the result page, you now see a “wnid” in the URL. For example: “http://image-net.org/synset?wnid=n07873807”
Copy that ID, and use the command line:
./dl-imgnet.py n07873807 pizzas
The script is also able to make a search on terms:
./dl-imgnet.py pizza
In this case, a search is made on the given term, and a list of IDs with description and link is displayed - but no download is made. You can visit provided links and choose IDs.
You may use a list of IDs, if you want to merge several classification (eg. French Fries on imagenet site is splitted in 2 classes):
./dl-imgnet.py n07711080,n07711232 fries
Note the script will create a CSV file keeping downloaded image urls, md5 sum of the file, the classname and the “id” of imagenet class. This is usefull to not download already downloeded files and to keep the source of images. This file is named data.csv and can be changed with -c argument.
The data.csv file provides information:
- Sysnet ID
- URL of the image
- classname
- ID of collection from imagenet
Note also, the script save image with name template <sysnet ID>.<type>, the type is defined by imghdr.what() function (python3 standard). That way, it’s possible to have several URL having the same image name, but not the same content.
It will create a pizzas directory and download all valid images inside. You can change the destination:
./dl-imgnet.py n07873807 pizzas -d base
This will create a base/pizzas directory and use it to downoad images.
You may provide others options:
-dor--destto indicate the destination directory where to download images (don’t provide the image class, the script concatanates the classname to the destination) - default is the current directory-tor--timeoutto indicate how many second to set on request timeout-nor--num-workerto indicate how many parallel download to use. Default is set to your CPU number.-cor--csvpath to the CSV file where to keep downloaded images information
splitter
Sometimes, you need to split train and validation images in separated directories. Splitter.py file will help you.
./splitter.py rep/to/pizzas
It will create “valid” and “train” directories and copy random images from the rep/to/pizzas directory in both directories. The default fraction of image to send to “valid” is “.2” (20%).
train.csv and valid.csv are also written in the destination directory. If you launch again the splitter, so that files are updated to append others classes/files. You’ll need to delete the CSV files if you want a new clean list. This is not the same behaviour with the --all option that removes CSV files before to recreate them.
You can tell splitter to split the entire classes from a “base” directory:
./splitter.py rep/without/class -a
This will find the entire class list and split them in “train” and “valid” directories. Note that in that case, the CSV files are deleted before to be rewritten !
You may want to not copy images and only want CSV files (train.csv and valid.csv), so you can use -C or --csv-only option.
One more time, you can change destination and/or fraction to split:
./splitter.py base/pizzas -d data -f .3
This time, it split pizza images with 30% for validation, and images are copied in data/valid/pizzas and data/train/pizzas.
Of course, you can split several “base”, it will not remove the other directories.
You can change the options:
-aor--allto split the whole class list found in SRC directory.-Cor--csv-onlywill not copy images in destination, only create the CSV files containing filename and classname.-dor--destto indicate the destination root directory where will reside “train” and “valid” directory (eg. “data”, will create “data/train/pizza” for example)-for--fracto a number between .1 and .9 (note that you should use recommended values: .2 or .3)-cor--classnameto indicate another classname that the source directory name.