ImageNet dataset #764

pmeier · 2019-02-26T13:18:15Z

This is my first attempt to implement the ImageNet dataset as discussed in #713. I only used the official files, which can be downloaded here.

Since the training and validation set as well as the meta information have to be downloaded separately I see no downside to structure the dataset properly. In anticipation of multiple years of the ILSVRC I've created the following dataset structure:

ILSVRC
└── 2012
    ├── meta.bin
    ├── train
        ├── n012345678
        ...
    └── val
        ├── n012345678
        ...
└── 2013
    ...

The sysnet identifier and the corresponding converter are accessible via the attributes wnids and wnid_to_idx while classes and class_to_idx now refer to the human-readable classes.

Major Edits

Within 616492e the attribute year was removed as requested. Thus, the tree now looks like this:

ILSVRC
├── meta.bin
├── train
    ├── n012345678
     ...
├── train.tar
├── val
    ├── n012345678
     ...
└── val.tar

To Do

~~For now only the classification challenge is supported.~~
~~For now only the ILSVRC2012 is supported. The parsing of the development kit, which contains the meta information, is (probably) not yet applicable to other years.~~
~~I don't know, if the meta information is changing between the years. If not, I think it would be preferrable to have only one meta file in the ILSVRC folder.~~
~~The class_to_idx converter is not a good solution in its current state, since one requires the tuple of all correct class names to convert it to an index~~.
In its current state, the meta file also contains the ground truth data of the validation set. I save this, since it is needed to prepare the validation folder. Afterwards this information is not needed anymore.
~~For now this is only tested on the validation set, since the download of the training set is still ongoing.~~

Let me know what you think.

codecov-io · 2019-02-26T14:09:35Z

Codecov Report

Merging #764 into master will decrease coverage by 0.68%.
The diff coverage is 21.8%.

@@            Coverage Diff             @@
##           master     #764      +/-   ##
==========================================
- Coverage   38.13%   37.44%   -0.69%     
==========================================
  Files          32       33       +1     
  Lines        3126     3261     +135     
  Branches      487      521      +34     
==========================================
+ Hits         1192     1221      +29     
- Misses       1855     1961     +106     
  Partials       79       79

Impacted Files	Coverage Δ
torchvision/datasets/__init__.py	`100% <100%> (ø)`	⬆️
torchvision/datasets/imagenet.py	`21.21% <21.21%> (ø)`
torchvision/datasets/fakedata.py	`22.85% <0%> (-1.39%)`	⬇️
torchvision/transforms/transforms.py	`83.41% <0%> (ø)`	⬆️
torchvision/models/googlenet.py	`15.87% <0%> (ø)`	⬆️
torchvision/models/resnet.py	`17.29% <0%> (ø)`	⬆️
torchvision/models/inception.py	`14.41% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d9f48a...1e75006. Read the comment docs.

fmassa · 2019-02-26T14:15:03Z

Hi,

One quick question: the ImageNetDetection that you mention is actually for classification, and not really for detection, right? I don't see mention of the bounding boxes for detection in your implementation

pmeier · 2019-02-26T14:21:36Z

@fmassa You are right. I misinterpreted the meaning of detection and segmentation.

fmassa

Thanks for the PR!

This looks generally good.

I have some comments, let me know what you think

torchvision/datasets/imagenet.py

…f it

fmassa · 2019-03-09T22:18:47Z

I think this generally looks great, thanks!

I've made one more comment, and I'd have a question to you: what would be a good (or even the best) way to have a testing code to verify that the Dataset logic works fine?

Simply downloading the model file would be prohibitively expensive for large datasets (such as ImageNet). We could patch the download logic during testing, or have some small test files that would run during continuous integration.

I'd love to have your feedback here

pmeier · 2019-03-10T13:55:49Z

I don't know if this is possible and feasible, but we could package our own fake datasets. They should resemble the structure of the original dataset, but with drastically reduced number of instances (e. g. only one image per class for ImageNet). These could be downloaded and extracted quickly and thus a test could be run within continous integration.

But I don't think it is sufficient to check the download and extraction process this way even if we use the original dataset: no exception occuring during this process is IMO not a good metric to assert that the dataset is ready for usage afterwards. We should also check some statistics.

We could start off by creating dataset objects of all different combinations (mostly different splits but also years for VOC etc.). For each of these objects we could check the following:

Is the number of instances correct?
Does the mean (or any other reasonable summary statistic) of all instances equal some pre-computed value?

If these checks pass, we can be sure that the download and extraction works correctly. However, we need to calculate the stored summary statistics without using the implemented procedure to avoid circular reasoning.

P.S.

On a second thought, some variant of this could be also used verify the integrity of an ImageFolder dataset as ImageNet. While calculating the statistics for all instances is probably too time-consuming, applying this on a small subset (e. g. the first, the median, and the last) could already suffice.

fmassa · 2019-03-11T10:09:25Z

@pmeier I think this is in the right direction!

I think we don't need to have original dataset images to test the whole pipeline: only a set of randomly generated images of small size would be necessary.
For example, if we patch (using mock) download_and_extract_tar (or maybe even tarfile), we can test all the functionality of this dataset without having to download a single file.

While I agree that having integrity tests on the data downloaded would be nice to test, I think that it might actually make things more complicated: I'm not sure we can zip a few images from ImageNet due to licensing issues.

Here is an example of something that I think could be a nice inspiration: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image/imagenet_test.py and https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/testing/dataset_builder_testing.py
They have downloading logic for ImageNet, but they monkey-patch only a few functions so that everything can still be tested.

fmassa · 2019-03-11T10:11:39Z

torchvision/datasets/imagenet.py

+    def _empty_split_folder(self):
+        try:
+            shutil.rmtree(self.split_folder)
+        except FileNotFoundError:


This is Python3 only unfortunately.

I've put in

if sys.version_info[0] == 2: # FIXME: I don't know if this is good pratice / robust FileNotFoundError = OSError

on top of the module. This worked for me in a quick test, but the FIXME note should be taken literally.

Removed in b0bc90e, since we no longer empty the split_folder. I would still appreciate some feedback on this in order to avoid similar situations in the future.

pmeier · 2019-03-11T11:28:20Z

Hm, you are right. I didn't think of the licensing issues. I've never worked with mock so it will take me some time to review the examples and come up with something.

On a side note: should we move this discussion to another issue, since it applies to all datasets rather than only ImageNet?

fmassa · 2019-03-11T12:47:18Z

@pmeier yes, the testing is something that should be done for all datasets.

I'll start working on it right now. I believe this is a pretty high-pri feature to have right now, as it is very difficult to review dataset code and ensure that it works as expected for both Py2 and Py3.

I'll try to push some POC implementation for the tests today and I'll tag you there

pmeier · 2019-03-18T08:47:24Z

@fmassa Any progress?

fmassa · 2019-03-19T11:08:31Z

@pmeier I didn't have the time yet to finish the tests for the datasets: this involved refactoring the downloading abstractions and I got stuck with other things. I'll add tests for this PR after merging it.

fmassa

Thanks!

I'll be adding tests for this class in a follow-up PR

Philip Meier added 2 commits February 26, 2019 14:14

initial commit

6897e2c

fixed Python2 issue

abc8a20

fixed naming incorrectness and Python2 compability

58eddcd

fmassa reviewed Feb 26, 2019

View reviewed changes

Philip Meier added 8 commits February 27, 2019 09:20

fixed preparation of train folder

da3192a

removed detection dataset

0fbff27

added docstring and repr

eb2802e

moved import of scipy to make the import of torchvision independent o…

6add179

…f it

improved conversion from class string to index

d6983b3

removed support for other years than 2012

616492e

resolved conflicts with master

00658a2

removed accidentally added file

71f9efe

fmassa reviewed Mar 11, 2019

View reviewed changes

moved emptying of split folder to avoid accidental deletion

c4b8148

Philip Meier added 3 commits March 11, 2019 15:08

removed deletion of the images

807906a

removed error conversion for Python2

b0bc90e

Aligned class indices with the indices identified by ImageFolder class

1e75006

fmassa approved these changes Mar 19, 2019

View reviewed changes

fmassa merged commit 6938291 into pytorch:master Mar 19, 2019

pmeier mentioned this pull request Apr 6, 2019

[Request] Provide class names for imagenet models #713

Closed

pmeier deleted the imagenet_dataset branch April 10, 2019 06:12

ImageNet dataset #764

ImageNet dataset #764

Uh oh!

Conversation

pmeier commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major Edits

To Do

Uh oh!

codecov-io commented Feb 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fmassa commented Feb 26, 2019

Uh oh!

pmeier commented Feb 26, 2019

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fmassa commented Mar 9, 2019

Uh oh!

pmeier commented Mar 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa commented Mar 11, 2019

Uh oh!

fmassa Mar 11, 2019

Choose a reason for hiding this comment

Uh oh!

pmeier Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier commented Mar 11, 2019

Uh oh!

fmassa commented Mar 11, 2019

Uh oh!

pmeier commented Mar 18, 2019

Uh oh!

fmassa commented Mar 19, 2019

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmeier commented Feb 26, 2019 •

edited

Loading

codecov-io commented Feb 26, 2019 •

edited

Loading

pmeier commented Mar 10, 2019 •

edited

Loading

pmeier Mar 11, 2019 •

edited

Loading