-
Notifications
You must be signed in to change notification settings - Fork 7.1k
ImageNet dataset #764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImageNet dataset #764
Conversation
Codecov Report
@@ Coverage Diff @@
## master #764 +/- ##
==========================================
- Coverage 38.13% 37.44% -0.69%
==========================================
Files 32 33 +1
Lines 3126 3261 +135
Branches 487 521 +34
==========================================
+ Hits 1192 1221 +29
- Misses 1855 1961 +106
Partials 79 79
Continue to review full report at Codecov.
|
Hi, One quick question: the |
@fmassa You are right. I misinterpreted the meaning of detection and segmentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
This looks generally good.
I have some comments, let me know what you think
I think this generally looks great, thanks! I've made one more comment, and I'd have a question to you: what would be a good (or even the best) way to have a testing code to verify that the Dataset logic works fine? Simply downloading the model file would be prohibitively expensive for large datasets (such as ImageNet). We could patch the download logic during testing, or have some small test files that would run during continuous integration. I'd love to have your feedback here |
I don't know if this is possible and feasible, but we could package our own fake datasets. They should resemble the structure of the original dataset, but with drastically reduced number of instances (e. g. only one image per class for But I don't think it is sufficient to check the download and extraction process this way even if we use the original dataset: no exception occuring during this process is IMO not a good metric to assert that the dataset is ready for usage afterwards. We should also check some statistics. We could start off by creating dataset objects of all different combinations (mostly different
If these checks pass, we can be sure that the download and extraction works correctly. However, we need to calculate the stored summary statistics without using the implemented procedure to avoid circular reasoning. P.S. On a second thought, some variant of this could be also used verify the integrity of an |
@pmeier I think this is in the right direction! I think we don't need to have original dataset images to test the whole pipeline: only a set of randomly generated images of small size would be necessary. While I agree that having integrity tests on the data downloaded would be nice to test, I think that it might actually make things more complicated: I'm not sure we can zip a few images from ImageNet due to licensing issues. Here is an example of something that I think could be a nice inspiration: https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image/imagenet_test.py and https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/testing/dataset_builder_testing.py |
torchvision/datasets/imagenet.py
Outdated
def _empty_split_folder(self): | ||
try: | ||
shutil.rmtree(self.split_folder) | ||
except FileNotFoundError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is Python3 only unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put in
if sys.version_info[0] == 2:
# FIXME: I don't know if this is good pratice / robust
FileNotFoundError = OSError
on top of the module. This worked for me in a quick test, but the FIXME
note should be taken literally.
Removed in b0bc90e, since we no longer empty the split_folder
. I would still appreciate some feedback on this in order to avoid similar situations in the future.
Hm, you are right. I didn't think of the licensing issues. I've never worked with On a side note: should we move this discussion to another issue, since it applies to all datasets rather than only ImageNet? |
@pmeier yes, the testing is something that should be done for all datasets. I'll start working on it right now. I believe this is a pretty high-pri feature to have right now, as it is very difficult to review dataset code and ensure that it works as expected for both Py2 and Py3. I'll try to push some POC implementation for the tests today and I'll tag you there |
@fmassa Any progress? |
@pmeier I didn't have the time yet to finish the tests for the datasets: this involved refactoring the downloading abstractions and I got stuck with other things. I'll add tests for this PR after merging it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I'll be adding tests for this class in a follow-up PR
This is my first attempt to implement the
ImageNet
dataset as discussed in #713. I only used the official files, which can be downloaded here.Since the training and validation set as well as the meta information have to be downloaded separately I see no downside to structure the dataset properly. In anticipation of multiple years of the ILSVRC I've created the following dataset structure:
The sysnet identifier and the corresponding converter are accessible via the attributes
wnids
andwnid_to_idx
whileclasses
andclass_to_idx
now refer to the human-readable classes.Major Edits
year
was removed as requested. Thus, the tree now looks like this:To Do
For now only the classification challenge is supported.For now only the ILSVRC2012 is supported. The parsing of the development kit, which contains the meta information, is (probably) not yet applicable to other years.I don't know, if the meta information is changing between the years. If not, I think it would be preferrable to have only one meta file in theILSVRC
folder.The.class_to_idx
converter is not a good solution in its current state, since one requires the tuple of all correct class names to convert it to an indexFor now this is only tested on the validation set, since the download of the training set is still ongoing.Let me know what you think.