Skip to content

fix eurosat prototype mock data setup #5549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 14, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 4, 2022

torchdata reported CI failures for Windows and macOS for the EuroSAT prototype from #5452 cc @NivekT @Dbhasin1. The mock data is setup slightly wrong, but that shouldn't matter were it not for a quirk on Windows and macOS. Details follow inline.

We didn't detect that, because our prototype tests only run on Linux. I've switched to Windows for debugging. Passing Windows CI run with the patch from this PR: https://app.circleci.com/pipelines/github/pytorch/vision/15300/workflows/c9eb674a-a3fa-4bcc-9824-c047beb47579/jobs/1235498

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 4, 2022

💊 CI failures summary and remediations

As of commit 3062c71 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@pmeier pmeier force-pushed the eurosat-windows-macos branch from fa803fb to ed0e2a6 Compare March 4, 2022 07:30
@pmeier pmeier force-pushed the eurosat-windows-macos branch from ed0e2a6 to a5b4f2d Compare March 4, 2022 07:33
@@ -1329,20 +1329,20 @@ def cub200(info, root, config):

@register_mock
def eurosat(info, root, config):
data_folder = pathlib.Path(root, "eurosat", "2750")
data_folder = pathlib.Path(root, "2750")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra eurosat folder is a remnant from the legacy mock data and is not present in the actual archive.

name=cls,
file_name_fn=lambda idx: f"{cls}_{idx}.jpg",
name=category,
file_name_fn=lambda idx: f"{category}_{idx + 1}.jpg",
num_examples=num_examples_per_class,
)
make_zip(root, "EuroSAT.zip", data_folder)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using data_folder here, it is assumed that this folder and all of its contents should be included in the archive. By default, make_zip removes all files and folders it includes. Since the extra eurosat folder is not included, it is left as is. Thus, there are now an empty eurosat folder as well as the EuroSAT.zip file in the root directory.

Although not correct, that wouldn't be an issue if weren't for a quirk on Windows and macOS: pathlib treats file and folder names case-insensitive 🤯 This means,

stem = path.name.replace("".join(path.suffixes), "")
# In a first step, we check for a folder with the same stem as the raw file. If it exists, we use it since
# extracted files give the best I/O performance. Note that OnlineResource._extract() makes sure that an archive
# is always extracted in a folder with the corresponding file name.
folder_candidate = path.parent / stem
if folder_candidate.exists() and folder_candidate.is_dir():
return self._loader(folder_candidate)

is picking up the empty eurosat folder as a candidate for EuroSAT.zip although it should only look for an EuroSAT folder. Since we prefer extracted files over archives, it now loads the empty folder and thus all tests fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows and macOS: pathlib treats file and folder names case-insensitive

Is this a bug, or a documented "feature" ? Are we likely to hit more bugs in the future because of this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bug nor documented, since this is part of system. Neither macOS nor Windows support case sensitive files.

IMO the only places where we might hit this is in the test suite if the mock data is setup wrong. A scenario like above cannot happen during normal operation unless a user manually manipulates the data folders.

@pmeier pmeier changed the title [DEBUG] eurosat prototype on windows and macos fix eurosat prototype mock data setup Mar 4, 2022
@pmeier pmeier requested a review from NicolasHug March 4, 2022 09:05
@pmeier pmeier marked this pull request as ready for review March 4, 2022 09:05
Comment on lines +55 to +56
except StopIteration:
raise AssertionError("Unable to draw any sample.") from None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this relates to the fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes this case more expressive. Before you got an error message like Drawing a sample resulted in the error above while "above" is a plain StopIteration. If you are familiar with datapipes, you should be able to interpret this, but new contributors might not be. I've fixed this here, because we were hitting this exact case.

name=cls,
file_name_fn=lambda idx: f"{cls}_{idx}.jpg",
name=category,
file_name_fn=lambda idx: f"{category}_{idx + 1}.jpg",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how the idx + 1 part relates to the fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't, but it aligns the mock data more with real data. I just found this while looking into the real data to debug this issue and didn't want to have an extra PR for it.

name=cls,
file_name_fn=lambda idx: f"{cls}_{idx}.jpg",
name=category,
file_name_fn=lambda idx: f"{category}_{idx + 1}.jpg",
num_examples=num_examples_per_class,
)
make_zip(root, "EuroSAT.zip", data_folder)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows and macOS: pathlib treats file and folder names case-insensitive

Is this a bug, or a documented "feature" ? Are we likely to hit more bugs in the future because of this?

@@ -1329,20 +1329,20 @@ def cub200(info, root, config):

@register_mock
def eurosat(info, root, config):
data_folder = pathlib.Path(root, "eurosat", "2750")
data_folder = root / "2750"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the linux tests were passing before with the extra eurosat folder?
I would assume that if we change the structure in the , this would have to be reflected somehow in the dataset code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the prototype datasets never look at the absolute path, but only relative starting from the right

category = pathlib.Path(path).parent.name

adding more folders on the left side makes no difference.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmeier

@pmeier pmeier merged commit 7bb8186 into pytorch:main Mar 14, 2022
@pmeier pmeier deleted the eurosat-windows-macos branch March 14, 2022 10:38
facebook-github-bot pushed a commit that referenced this pull request Mar 15, 2022
Summary:
* [DEBUG] eurosat prototype on windows and macos

* print paths

* fix eurosat mock data setup

* revert changes

* minor cleanup

Reviewed By: vmoens

Differential Revision: D34878969

fbshipit-source-id: 6784c6119b65ec719683195e7d1ebbf22a9d1b2d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants