-
Notifications
You must be signed in to change notification settings - Fork 7.1k
fix prototype transforms tests with set agg_method #6934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
self._make_error_meta(AssertionError, "aggregated mismatch") | ||
agg_abs_diff = float(self.agg_method(abs_diff.to(torch.float64))) | ||
if agg_abs_diff > self.atol: | ||
raise self._make_error_meta( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the source of the actual bug. Without the raise we just created an exception, but never did anything with it. Thus, all tests that set agg_method
in their closeness_kwargs
passed without a value check.
@@ -54,7 +46,7 @@ | |||
] | |||
|
|||
|
|||
class PILImagePair(TensorLikePair): | |||
class ImagePair(TensorLikePair): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've refactored this to not handle mixed PIL / tensor image pairs. It was a problem if the tolerance is set for floating point images, i.e. in the range [0.0, 1.0]
, but the comparison converted to uint8, which needs higher tolerances.
(("TestKernels", "test_against_reference"), torch.float32, "cpu"): dict(atol=0.9, rtol=0, agg_method="mean"), | ||
(("TestKernels", "test_against_reference"), torch.uint8, "cpu"): dict(atol=255 * 0.9, rtol=0, agg_method="mean"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are crazy tolerances and render the test effectively useless. This PR just lays the foundation to fix these in the near future. I'll open an issue ASAP with a game plan to review all the operators again that need these tolerances for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are crazy tolerances indeed. From your earlier comment, I understand that the tests were effectively not throwing exceptions, so here you adjust the values to make them pass and then revisit all kernels that fail. Is my understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. See #6937.
# 2D mask shenanigans | ||
if output_tensor.ndim == 2 and input_tensor.ndim == 3: | ||
output_tensor = output_tensor.unsqueeze(0) | ||
elif output_tensor.ndim == 3 and input_tensor.ndim == 2: | ||
output_tensor = output_tensor.squeeze(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all other shape mismatches we can let the comparison logic handle the error.
@@ -2070,6 +2099,17 @@ def sample_inputs_ten_crop_video(): | |||
yield ArgsKwargs(video_loader, size=size) | |||
|
|||
|
|||
def multi_crop_pil_reference_wrapper(pil_kernel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small helper since the regular pil_reference_wrapper
cannot handle conversion of these tuple or list outputs.
expected_image, expected_mask = t_ref(*dp_ref) | ||
if isinstance(actual_image, torch.Tensor) and not isinstance(expected_image, torch.Tensor): | ||
expected_image = legacy_F.pil_to_tensor(expected_image) | ||
expected_mask = legacy_F.pil_to_tensor(expected_mask).squeeze(0) | ||
expected = (expected_image, expected_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was relying on the mixed PIL / tensor image comparison that was removed above. Thus, we do it manually here.
@@ -237,7 +237,6 @@ def test_against_reference(self, test_id, info, args_kwargs): | |||
assert_close( | |||
actual, | |||
expected, | |||
check_dtype=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can be strict here now, since we perform the conversion correctly.
Even with the crazy tolerances, there are still failures in CI:
I can't reproduce locally, so this is probably flaky. Given that I will look into this ASAP, I wouldn't block over this. Only other safe options is to disable the reference tests for these kernels for now. Your choice. |
@pmeier I think it's worth fixing the issues in this PR and merging later. What's the purpose of making a bunch of test corrections and then disabling the tests? No strong opinion though. |
I don't know the extend of the changes needed yet. This PR might get out of hand. But I'm not the one that needs to review it, sol I'll leave it up to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offering a stamp to unblock the work. Up to you @pmeier if you want to make the changes here to make all tests pass or do it gradually on follow ups. My only concern is turning off all references might be dangerous. Perhaps we could skip only those that fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest commit I removed the DEFAULT_PIL_REFERENCE_CLOSENESS_KWARGS
in favor of setting the tolerances for each kernel separately. Some of them are pretty wild and I'm going to highlight them below. Given that everything works in our E2E tests, IMO this is either caused by parameter combinations that we don't use or a problematic test setup.
We need to investigate them in any case. Let's use #6937 to coordinate. I'll update it soon to check everything that was already done in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still LGTM.
A few comments:
- We do know that accuracy is not affected but it might be the case that a kernel is broken/buggy on setups we haven't tested (floats for instance) or that the specific augmentation doesn't have much effect on accuracy. So as you highlighted we must check things very thoroughly to understand if this is a test problem or an actual bug.
- I wonder if it's worth it to focus on single pixel differences rather than averages. As averages can accumulate, it might worth examining if max single pixel tolerances can be stricter.
- Some of the tolerances look scary. We should bisect and revert any speed optimizations that cause them. It might be also worth checking our tensor kernels against PIL. Stable tests do that now and this is how we ensure that stable and PIL stay aligned. V2 doesn't have such tests (of course it compares itself with V1) but due to the increased tolerances we might be missing something important.
Let's merge on green CI and take this as top priority task to resolve.
In the end I've refactored this by quite a bit so we get a clearer picture. Here are the main changes this PR brings:
With these changes, there are only a couple of operators with wild tolerances left to check:
They are tracked in #6937 and that will be used after this PR. Since these tests are pretty flaky right now, I cranked up the tolerances to avoid problems with the CI. Let's see if I need more. |
Summary: * fix prototype transforms tests with set agg_method * use individual tolerances * refactor PIL reference test * increase tolerance for elastic_mask * fix autocontrast tolerances * increase tolerance for RandomAutocontrast Reviewed By: NicolasHug Differential Revision: D41265197 fbshipit-source-id: 57eba523c4e6672d8a1d0cae8b7b95f1d52f13bf
assert_{close, equal}
was buggy for prototype tests if one set theagg_method
as we do for a few tests throughvision/test/prototype_transforms_kernel_infos.py
Lines 64 to 67 in f32600b
I only wanted to fix that, but realized fixing this entails a lot more. A lot of the reference test that used these tolerances failed. Thus, this PR only lays the foundation to fix them in the near future.
cc @vfdev-5 @datumbox @bjuncek