Skip to content

fix eurosat prototype mock data setup #5549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions test/builtin_dataset_mocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1329,20 +1329,20 @@ def cub200(info, root, config):

@register_mock
def eurosat(info, root, config):
data_folder = pathlib.Path(root, "eurosat", "2750")
data_folder = root / "2750"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the linux tests were passing before with the extra eurosat folder?
I would assume that if we change the structure in the , this would have to be reflected somehow in the dataset code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the prototype datasets never look at the absolute path, but only relative starting from the right

category = pathlib.Path(path).parent.name

adding more folders on the left side makes no difference.

data_folder.mkdir(parents=True)

num_examples_per_class = 3
classes = ("AnnualCrop", "Forest")
for cls in classes:
categories = ["AnnualCrop", "Forest"]
for category in categories:
create_image_folder(
root=data_folder,
name=cls,
file_name_fn=lambda idx: f"{cls}_{idx}.jpg",
name=category,
file_name_fn=lambda idx: f"{category}_{idx + 1}.jpg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how the idx + 1 part relates to the fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but it aligns the mock data more with real data. I just found this while looking into the real data to debug this issue and didn't want to have an extra PR for it.

num_examples=num_examples_per_class,
)
make_zip(root, "EuroSAT.zip", data_folder)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using data_folder here, it is assumed that this folder and all of its contents should be included in the archive. By default, make_zip removes all files and folders it includes. Since the extra eurosat folder is not included, it is left as is. Thus, there are now an empty eurosat folder as well as the EuroSAT.zip file in the root directory.

Although not correct, that wouldn't be an issue if weren't for a quirk on Windows and macOS: pathlib treats file and folder names case-insensitive 🤯 This means,

stem = path.name.replace("".join(path.suffixes), "")
# In a first step, we check for a folder with the same stem as the raw file. If it exists, we use it since
# extracted files give the best I/O performance. Note that OnlineResource._extract() makes sure that an archive
# is always extracted in a folder with the corresponding file name.
folder_candidate = path.parent / stem
if folder_candidate.exists() and folder_candidate.is_dir():
return self._loader(folder_candidate)

is picking up the empty eurosat folder as a candidate for EuroSAT.zip although it should only look for an EuroSAT folder. Since we prefer extracted files over archives, it now loads the empty folder and thus all tests fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows and macOS: pathlib treats file and folder names case-insensitive

Is this a bug, or a documented "feature" ? Are we likely to hit more bugs in the future because of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bug nor documented, since this is part of system. Neither macOS nor Windows support case sensitive files.

IMO the only places where we might hit this is in the test suite if the mock data is setup wrong. A scenario like above cannot happen during normal operation unless a user manually manipulates the data folders.

return len(classes) * num_examples_per_class
return len(categories) * num_examples_per_class


@register_mock
Expand Down
2 changes: 2 additions & 0 deletions test/test_prototype_builtin_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ def test_sample(self, test_home, dataset_mock, config):

try:
sample = next(iter(dataset))
except StopIteration:
raise AssertionError("Unable to draw any sample.") from None
Comment on lines +56 to +57
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this relates to the fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes this case more expressive. Before you got an error message like Drawing a sample resulted in the error above while "above" is a plain StopIteration. If you are familiar with datapipes, you should be able to interpret this, but new contributors might not be. I've fixed this here, because we were hitting this exact case.

except Exception as error:
raise AssertionError("Drawing a sample raised the error above.") from error

Expand Down