Skip to content

fix prototype transforms tests with set agg_method #6934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 11, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Nov 9, 2022

assert_{close, equal} was buggy for prototype tests if one set the agg_method as we do for a few tests through

DEFAULT_PIL_REFERENCE_CLOSENESS_KWARGS = {
(("TestKernels", "test_against_reference"), torch.float32, "cpu"): dict(atol=1e-5, rtol=0, agg_method="mean"),
(("TestKernels", "test_against_reference"), torch.uint8, "cpu"): dict(atol=1e-5, rtol=0, agg_method="mean"),
}

I only wanted to fix that, but realized fixing this entails a lot more. A lot of the reference test that used these tolerances failed. Thus, this PR only lays the foundation to fix them in the near future.

cc @vfdev-5 @datumbox @bjuncek

self._make_error_meta(AssertionError, "aggregated mismatch")
agg_abs_diff = float(self.agg_method(abs_diff.to(torch.float64)))
if agg_abs_diff > self.atol:
raise self._make_error_meta(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the source of the actual bug. Without the raise we just created an exception, but never did anything with it. Thus, all tests that set agg_method in their closeness_kwargs passed without a value check.

@@ -54,7 +46,7 @@
]


class PILImagePair(TensorLikePair):
class ImagePair(TensorLikePair):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored this to not handle mixed PIL / tensor image pairs. It was a problem if the tolerance is set for floating point images, i.e. in the range [0.0, 1.0], but the comparison converted to uint8, which needs higher tolerances.

Comment on lines 66 to 67
(("TestKernels", "test_against_reference"), torch.float32, "cpu"): dict(atol=0.9, rtol=0, agg_method="mean"),
(("TestKernels", "test_against_reference"), torch.uint8, "cpu"): dict(atol=255 * 0.9, rtol=0, agg_method="mean"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are crazy tolerances and render the test effectively useless. This PR just lays the foundation to fix these in the near future. I'll open an issue ASAP with a game plan to review all the operators again that need these tolerances for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are crazy tolerances indeed. From your earlier comment, I understand that the tests were effectively not throwing exceptions, so here you adjust the values to make them pass and then revisit all kernels that fail. Is my understanding correct?

Copy link
Collaborator Author

@pmeier pmeier Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. See #6937.

Comment on lines +91 to +95
# 2D mask shenanigans
if output_tensor.ndim == 2 and input_tensor.ndim == 3:
output_tensor = output_tensor.unsqueeze(0)
elif output_tensor.ndim == 3 and input_tensor.ndim == 2:
output_tensor = output_tensor.squeeze(0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all other shape mismatches we can let the comparison logic handle the error.

@@ -2070,6 +2099,17 @@ def sample_inputs_ten_crop_video():
yield ArgsKwargs(video_loader, size=size)


def multi_crop_pil_reference_wrapper(pil_kernel):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small helper since the regular pil_reference_wrapper cannot handle conversion of these tuple or list outputs.

Comment on lines +1025 to +1029
expected_image, expected_mask = t_ref(*dp_ref)
if isinstance(actual_image, torch.Tensor) and not isinstance(expected_image, torch.Tensor):
expected_image = legacy_F.pil_to_tensor(expected_image)
expected_mask = legacy_F.pil_to_tensor(expected_mask).squeeze(0)
expected = (expected_image, expected_mask)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was relying on the mixed PIL / tensor image comparison that was removed above. Thus, we do it manually here.

@@ -237,7 +237,6 @@ def test_against_reference(self, test_id, info, args_kwargs):
assert_close(
actual,
expected,
check_dtype=False,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be strict here now, since we perform the conversion correctly.

@pmeier pmeier marked this pull request as ready for review November 9, 2022 15:49
@pmeier pmeier requested review from datumbox and vfdev-5 November 9, 2022 15:49
@pmeier
Copy link
Collaborator Author

pmeier commented Nov 9, 2022

Even with the crazy tolerances, there are still failures in CI:

_____ TestKernels.test_against_reference[adjust_contrast_image_tensor-30] ______
Traceback (most recent call last):
  File "/home/runner/work/vision/vision/test/test_prototype_transforms_functional.py", line 240, in test_against_reference
    **info.get_closeness_kwargs(test_id, dtype=input.dtype, device=input.device),
  File "/home/runner/work/vision/vision/test/prototype_common_utils.py", line 132, in assert_close
    **kwargs,
  File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/torch/testing/_comparison.py", line 1118, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: The 'mean' of the absolute difference is 231.9391304347826, but only 229.5 is allowed.

I can't reproduce locally, so this is probably flaky. Given that I will look into this ASAP, I wouldn't block over this. Only other safe options is to disable the reference tests for these kernels for now. Your choice.

@datumbox
Copy link
Contributor

datumbox commented Nov 9, 2022

@pmeier I think it's worth fixing the issues in this PR and merging later. What's the purpose of making a bunch of test corrections and then disabling the tests? No strong opinion though.

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 9, 2022

What's the purpose of making a bunch of test corrections and then disabling the tests?

I don't know the extend of the changes needed yet. This PR might get out of hand. But I'm not the one that needs to review it, sol I'll leave it up to you.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offering a stamp to unblock the work. Up to you @pmeier if you want to make the changes here to make all tests pass or do it gradually on follow ups. My only concern is turning off all references might be dangerous. Perhaps we could skip only those that fail?

Copy link
Collaborator Author

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest commit I removed the DEFAULT_PIL_REFERENCE_CLOSENESS_KWARGS in favor of setting the tolerances for each kernel separately. Some of them are pretty wild and I'm going to highlight them below. Given that everything works in our E2E tests, IMO this is either caused by parameter combinations that we don't use or a problematic test setup.

We need to investigate them in any case. Let's use #6937 to coordinate. I'll update it soon to check everything that was already done in this PR.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM.

A few comments:

  • We do know that accuracy is not affected but it might be the case that a kernel is broken/buggy on setups we haven't tested (floats for instance) or that the specific augmentation doesn't have much effect on accuracy. So as you highlighted we must check things very thoroughly to understand if this is a test problem or an actual bug.
  • I wonder if it's worth it to focus on single pixel differences rather than averages. As averages can accumulate, it might worth examining if max single pixel tolerances can be stricter.
  • Some of the tolerances look scary. We should bisect and revert any speed optimizations that cause them. It might be also worth checking our tensor kernels against PIL. Stable tests do that now and this is how we ensure that stable and PIL stay aligned. V2 doesn't have such tests (of course it compares itself with V1) but due to the increased tolerances we might be missing something important.

Let's merge on green CI and take this as top priority task to resolve.

@pmeier pmeier marked this pull request as draft November 10, 2022 12:52
@pmeier
Copy link
Collaborator Author

pmeier commented Nov 10, 2022

In the end I've refactored this by quite a bit so we get a clearer picture. Here are the main changes this PR brings:

  • Remove the ability to check PIL vs. tensor images in assert_close. The automatic type and dtype conversion made it really hard to correctly specify tolerances.
  • Remove the ability to perform reference tests against PIL with floating point tensors. Again, conversion automagic made it really hard to locate issues.
  • Add a new test that checks image kernels for consistency between uint8 and float32. With this and the reference test against PIL for uint8, we have the same information as before, but are able to pinpoint problems easier.

With these changes, there are only a couple of operators with wild tolerances left to check:

  • adjust_hue_image_tensor
  • elastic_image_tensor
  • elastic_mask
  • perspective_image_tensor
  • resize_image_tensor
  • resized_crop_image_tensor
  • rotate_image_tensor

They are tracked in #6937 and that will be used after this PR.

Since these tests are pretty flaky right now, I cranked up the tolerances to avoid problems with the CI. Let's see if I need more.

@pmeier pmeier marked this pull request as ready for review November 11, 2022 07:02
@pmeier pmeier merged commit 65769ab into pytorch:main Nov 11, 2022
@pmeier pmeier deleted the proto-test-tol branch November 11, 2022 08:12
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2022
Summary:
* fix prototype transforms tests with set agg_method

* use individual tolerances

* refactor PIL reference test

* increase tolerance for elastic_mask

* fix autocontrast tolerances

* increase tolerance for RandomAutocontrast

Reviewed By: NicolasHug

Differential Revision: D41265197

fbshipit-source-id: 57eba523c4e6672d8a1d0cae8b7b95f1d52f13bf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants