Skip to content

ImageNet dataset #764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 19, 2019
Merged

ImageNet dataset #764

merged 15 commits into from
Mar 19, 2019

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Feb 26, 2019

This is my first attempt to implement the ImageNet dataset as discussed in #713. I only used the official files, which can be downloaded here.

Since the training and validation set as well as the meta information have to be downloaded separately I see no downside to structure the dataset properly. In anticipation of multiple years of the ILSVRC I've created the following dataset structure:

ILSVRC
└── 2012
    ├── meta.bin
    ├── train
        ├── n012345678
        ...
    └── val
        ├── n012345678
        ...
└── 2013
    ...

The sysnet identifier and the corresponding converter are accessible via the attributes wnids and wnid_to_idx while classes and class_to_idx now refer to the human-readable classes.

Major Edits

  • Within 616492e the attribute year was removed as requested. Thus, the tree now looks like this:
ILSVRC
├── meta.bin
├── train
    ├── n012345678
     ...
├── train.tar
├── val
    ├── n012345678
     ...
└── val.tar

To Do

  • For now only the classification challenge is supported.
  • For now only the ILSVRC2012 is supported. The parsing of the development kit, which contains the meta information, is (probably) not yet applicable to other years.
  • I don't know, if the meta information is changing between the years. If not, I think it would be preferrable to have only one meta file in the ILSVRC folder.
  • The class_to_idx converter is not a good solution in its current state, since one requires the tuple of all correct class names to convert it to an index.
  • In its current state, the meta file also contains the ground truth data of the validation set. I save this, since it is needed to prepare the validation folder. Afterwards this information is not needed anymore.
  • For now this is only tested on the validation set, since the download of the training set is still ongoing.

Let me know what you think.

@codecov-io
Copy link

codecov-io commented Feb 26, 2019

Codecov Report

Merging #764 into master will decrease coverage by 0.68%.
The diff coverage is 21.8%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #764      +/-   ##
==========================================
- Coverage   38.13%   37.44%   -0.69%     
==========================================
  Files          32       33       +1     
  Lines        3126     3261     +135     
  Branches      487      521      +34     
==========================================
+ Hits         1192     1221      +29     
- Misses       1855     1961     +106     
  Partials       79       79
Impacted Files Coverage Δ
torchvision/datasets/__init__.py 100% <100%> (ø) ⬆️
torchvision/datasets/imagenet.py 21.21% <21.21%> (ø)
torchvision/datasets/fakedata.py 22.85% <0%> (-1.39%) ⬇️
torchvision/transforms/transforms.py 83.41% <0%> (ø) ⬆️
torchvision/models/googlenet.py 15.87% <0%> (ø) ⬆️
torchvision/models/resnet.py 17.29% <0%> (ø) ⬆️
torchvision/models/inception.py 14.41% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d9f48a...1e75006. Read the comment docs.

@fmassa
Copy link
Member

fmassa commented Feb 26, 2019

Hi,

One quick question: the ImageNetDetection that you mention is actually for classification, and not really for detection, right? I don't see mention of the bounding boxes for detection in your implementation

@pmeier
Copy link
Collaborator Author

pmeier commented Feb 26, 2019

@fmassa You are right. I misinterpreted the meaning of detection and segmentation.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

This looks generally good.

I have some comments, let me know what you think

@fmassa
Copy link
Member

fmassa commented Mar 9, 2019

I think this generally looks great, thanks!

I've made one more comment, and I'd have a question to you: what would be a good (or even the best) way to have a testing code to verify that the Dataset logic works fine?

Simply downloading the model file would be prohibitively expensive for large datasets (such as ImageNet). We could patch the download logic during testing, or have some small test files that would run during continuous integration.

I'd love to have your feedback here

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 10, 2019

I don't know if this is possible and feasible, but we could package our own fake datasets. They should resemble the structure of the original dataset, but with drastically reduced number of instances (e. g. only one image per class for ImageNet). These could be downloaded and extracted quickly and thus a test could be run within continous integration.

But I don't think it is sufficient to check the download and extraction process this way even if we use the original dataset: no exception occuring during this process is IMO not a good metric to assert that the dataset is ready for usage afterwards. We should also check some statistics.

We could start off by creating dataset objects of all different combinations (mostly different splits but also years for VOC etc.). For each of these objects we could check the following:

  • Is the number of instances correct?
  • Does the mean (or any other reasonable summary statistic) of all instances equal some pre-computed value?

If these checks pass, we can be sure that the download and extraction works correctly. However, we need to calculate the stored summary statistics without using the implemented procedure to avoid circular reasoning.


P.S.

On a second thought, some variant of this could be also used verify the integrity of an ImageFolder dataset as ImageNet. While calculating the statistics for all instances is probably too time-consuming, applying this on a small subset (e. g. the first, the median, and the last) could already suffice.

@fmassa
Copy link
Member

fmassa commented Mar 11, 2019

@pmeier I think this is in the right direction!

I think we don't need to have original dataset images to test the whole pipeline: only a set of randomly generated images of small size would be necessary.
For example, if we patch (using mock) download_and_extract_tar (or maybe even tarfile), we can test all the functionality of this dataset without having to download a single file.

While I agree that having integrity tests on the data downloaded would be nice to test, I think that it might actually make things more complicated: I'm not sure we can zip a few images from ImageNet due to licensing issues.

Here is an example of something that I think could be a nice inspiration: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image/imagenet_test.py and https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/testing/dataset_builder_testing.py
They have downloading logic for ImageNet, but they monkey-patch only a few functions so that everything can still be tested.

def _empty_split_folder(self):
try:
shutil.rmtree(self.split_folder)
except FileNotFoundError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Python3 only unfortunately.

Copy link
Collaborator Author

@pmeier pmeier Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put in

if sys.version_info[0] == 2:
    # FIXME: I don't know if this is good pratice / robust
    FileNotFoundError = OSError

on top of the module. This worked for me in a quick test, but the FIXME note should be taken literally.

Removed in b0bc90e, since we no longer empty the split_folder. I would still appreciate some feedback on this in order to avoid similar situations in the future.

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 11, 2019

Hm, you are right. I didn't think of the licensing issues. I've never worked with mock so it will take me some time to review the examples and come up with something.

On a side note: should we move this discussion to another issue, since it applies to all datasets rather than only ImageNet?

@fmassa
Copy link
Member

fmassa commented Mar 11, 2019

@pmeier yes, the testing is something that should be done for all datasets.

I'll start working on it right now. I believe this is a pretty high-pri feature to have right now, as it is very difficult to review dataset code and ensure that it works as expected for both Py2 and Py3.

I'll try to push some POC implementation for the tests today and I'll tag you there

@pmeier
Copy link
Collaborator Author

pmeier commented Mar 18, 2019

@fmassa Any progress?

@fmassa
Copy link
Member

fmassa commented Mar 19, 2019

@pmeier I didn't have the time yet to finish the tests for the datasets: this involved refactoring the downloading abstractions and I got stuck with other things. I'll add tests for this PR after merging it.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I'll be adding tests for this class in a follow-up PR

@fmassa fmassa merged commit 6938291 into pytorch:master Mar 19, 2019
@pmeier pmeier deleted the imagenet_dataset branch April 10, 2019 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants