Skip to content

Make transforms.functional_tensor functions differential w.r.t. their parameters #4995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 65 commits into from
Closed

Conversation

ain-soph
Copy link
Contributor

@ain-soph ain-soph commented Nov 26, 2021

Linked Issue: #5000

Make operations in torchvision.transforms.functional_tensor differential w.r.t. hyper-parameters, which is helpful for Faster AutoAugment search (hyper-parameters are learnable parameters via backward).

Some operations are not differential (e.g., Posterize), which might require users to write their own implementations.

Todo List:

  • JIT support
  • Affine has no backward gradient w.r.t. matrix.
  • Brightness, Contrast, Hue, Saturation, Gamma, Sharpness, Solarize
  • Affine, Rotate (require some modification in torchvision.transforms.functional)
  • corresponding annotation of methods in torchvision.transforms.functional
  • (solved in Fix bug on autocontrast when min==max #4999 )autocontrast seems to be incorrect. Need further check.

bound = 1.0 if img.is_floating_point() else 255.0
dtype = img.dtype if torch.is_floating_point(img) else torch.float32
minimum = img.amin(dim=(-2, -1), keepdim=True).to(dtype)
maximum = img.amax(dim=(-2, -1), keepdim=True).to(dtype)
eq_idxs = torch.where(minimum == maximum)[0]
minimum[eq_idxs] = 0
maximum[eq_idxs] = bound
scale = bound / (maximum - minimum)
return ((img - minimum) * scale).clamp(0, bound).to(img.dtype)

The input Image could be either (N, C, H, W) or (C, H, W), which makes L940 to rely on different dimensions. This is inconsistent behavior for these 2 cases.

Other implementations as reference:
https://github.com/moskomule/dda/blob/master/dda/functional.py#L203
https://pillow.readthedocs.io/en/stable/_modules/PIL/ImageOps.html#autocontrast


If any maintainer thinks this worth doing, I'll continue to work on it.

Make operations differential w.r.t. hyper-parameters, which is extremely helpful for AutoAugment search.
@facebook-github-bot
Copy link

facebook-github-bot commented Nov 26, 2021

💊 CI failures summary and remediations

As of commit 9ce5d09 (more details on the Dr. CI page):


  • 6/7 failures introduced in this PR
  • 1/7 broken upstream at merge base 8aa3174 on Dec 09 from 10:35am to 11:09am

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build unittest_linux_cpu_py3.7 (1/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/root/project/torchvision/io/video.py:399: Runt...log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.7/unittest/mock.py:1966: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    self.name = name

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.7/unittest/mock.py:1966: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    self.name = name

test/test_io.py::TestVideo::test_read_video_timestamps_corrupted_file
  /root/project/torchvision/io/video.py:399: RuntimeWarning: Failed to open container for /tmp/tmpeutnipai.mp4; Caught error: [Errno 1094995529] Invalid data found when processing input: '/tmp/tmpeutnipai.mp4'; last error log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found
    warnings.warn(msg, RuntimeWarning)

test/test_models.py::test_memory_efficient_densenet[densenet121]
test/test_models.py::test_memory_efficient_densenet[densenet169]
test/test_models.py::test_memory_efficient_densenet[densenet201]
test/test_models.py::test_memory_efficient_densenet[densenet161]
  /root/project/env/lib/python3.7/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
    warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

test/test_models.py::test_inception_v3_eval

See CircleCI build unittest_linux_gpu_py3.8 (2/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

raise error_metas[0].to_error()


�[31m�[1m_ test_random_rotate[1-InterpolationMode.BILINEAR-degrees3-False-center3-cuda] _�[0m

Traceback (most recent call last):

  File "/home/circleci/project/test/test_transforms_tensor.py", line 556, in test_random_rotate

    _test_transform_vs_scripted_on_batch(transform, s_transform, batch_tensors)

  File "/home/circleci/project/test/test_transforms_tensor.py", line 44, in _test_transform_vs_scripted_on_batch

    assert_equal(transformed_batch, s_transformed_batch, msg=msg)

  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 1217, in assert_close

    assert_equal(

  File "/home/circleci/project/env/lib/python3.8/site-packages/torch/testing/_comparison.py", line 997, in assert_equal

    raise error_metas[0].to_error()

AssertionError: Tensor-likes are not close!



Mismatched elements: 2 / 29568 (0.0%)

Greatest absolute difference: 1 at index (0, 2, 0, 42) (up to 1e-06 allowed)

Greatest relative difference: 0.009900989942252636 at index (0, 2, 0, 42) (up to 0 allowed)



�[33m=============================== warnings summary ===============================�[0m

test/test_backbone_utils.py:34

  /home/circleci/project/test/test_backbone_utils.py:34: PytestCollectionWarning: cannot collect test class 'TestSubModule' because it has a __init__ constructor (from: test/test_backbone_utils.py)

    class TestSubModule(torch.nn.Module):

See CircleCI build unittest_linux_cpu_py3.6 (3/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/root/project/torchvision/io/video.py:399: Runt...log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.6/unittest/mock.py:1857: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    setattr(_type, entry, MagicProxy(entry, self))

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.6/unittest/mock.py:1857: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    setattr(_type, entry, MagicProxy(entry, self))

test/test_io.py::TestVideo::test_read_video_timestamps_corrupted_file
  /root/project/torchvision/io/video.py:399: RuntimeWarning: Failed to open container for /tmp/tmp_5hxcp6j.mp4; Caught error: [Errno 1094995529] Invalid data found when processing input: '/tmp/tmp_5hxcp6j.mp4'; last error log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found
    warnings.warn(msg, RuntimeWarning)

test/test_models.py::test_memory_efficient_densenet[densenet121]
test/test_models.py::test_memory_efficient_densenet[densenet169]
test/test_models.py::test_memory_efficient_densenet[densenet201]
test/test_models.py::test_memory_efficient_densenet[densenet161]
  /root/project/env/lib/python3.6/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
    warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

test/test_models.py::test_inception_v3_eval

See CircleCI build unittest_linux_cpu_py3.9 (4/4)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

/root/project/torchvision/io/video.py:399: Runt...log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.9/unittest/mock.py:2059: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=12, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    setattr(_type, entry, MagicProxy(entry, self))

test/test_internet.py::TestDatasetUtils::test_download_url_dispatch_download_from_google_drive
  /root/project/env/lib/python3.9/unittest/mock.py:2059: ResourceWarning: unclosed <ssl.SSLSocket [closed] fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6>
    setattr(_type, entry, MagicProxy(entry, self))

test/test_io.py::TestVideo::test_read_video_timestamps_corrupted_file
  /root/project/torchvision/io/video.py:399: RuntimeWarning: Failed to open container for /tmp/tmp8aixo5ei.mp4; Caught error: [Errno 1094995529] Invalid data found when processing input: '/tmp/tmp8aixo5ei.mp4'; last error log: [mov,mp4,m4a,3gp,3g2,mj2] moov atom not found
    warnings.warn(msg, RuntimeWarning)

test/test_models.py::test_memory_efficient_densenet[densenet121]
test/test_models.py::test_memory_efficient_densenet[densenet169]
test/test_models.py::test_memory_efficient_densenet[densenet201]
test/test_models.py::test_memory_efficient_densenet[densenet161]
  /root/project/env/lib/python3.9/site-packages/torch/autocast_mode.py:162: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
    warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

test/test_models.py::test_inception_v3_eval

2 failures not recognized by patterns:

Job Step Action
CircleCI unittest_windows_gpu_py3.8 Run tests 🔁 rerun
CircleCI cmake_linux_gpu Build torchvision C++ distribution and test 🔁 rerun

1 job timed out:

  • cmake_linux_gpu

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@datumbox
Copy link
Contributor

datumbox commented Nov 26, 2021

@ain-soph Thanks for opening this.

What you propose is interesting but it might be worth discussing the API prior writing the PR. This is to avoid a situation where you spend lots of time working on the feature and not being able to merge due to some unforeseen limitation (JIT might give us headaches for example). Since you already opened a PR, we can discuss this here as well if you prefer.

@vfdev-5 Please share any thoughts on the endeavour.

Concerning the autoaugment behaviour, I would like to understand more why you say the behaviour is inconsistent.

The input Image could be either (N, C, H, W) or (C, H, W), which makes L940 to rely on different dimensions. This is inconsistent behavior for these 2 cases.

My understanding is that the lines at question always look for the min/max across H and W no matter the input. The next 2 lines ensure that we won't divide by 0. Could you provide an example that show-cases the issue?

@ain-soph
Copy link
Contributor Author

ain-soph commented Nov 26, 2021

@datumbox

import torch
import torchvision.transforms.functional as F


a=torch.rand(1,3,32,32)
a=a/2+0.3

(F.autocontrast(a)[0]==F.autocontrast(a[0])).all()   # True  (because eq_idxs are empty)

a[0,2]=0.7   # set a channel to be constant

(F.autocontrast(a)[0]==F.autocontrast(a[0])).all()   # False

We should expect it to be True. Because (1,C,H,W) should be same as (C,H,W). But the [0] in L940 makes it rely on the first dimension of the image, which might be N or C.

A potential fix might be unsqueeze (C,H,W) to be (1,C,H,W). But we need to make sure image-wise is what we expect:

when Red channel is constant while Green and Blue channels are not,

  1. (image-wise) we don't normalize the image.
  2. (channel-wise) we don't normalize Red channel, but normalize Green and Blue channels.

@datumbox
Copy link
Contributor

@ain-soph Thanks a lot for reporting and for submitting a snippet that reproduces the issue. I submitted an alternative solution for this at #4999. Let me know your thoughts.

@ain-soph
Copy link
Contributor Author

ain-soph commented Nov 28, 2021

@datumbox

Almost done (except JIT, which I'm totally unfamiliar). But there is one issue about the Affine.
The current implementation seems to have no gradient w.r.t. matrix (all zero).

def affine(
img: Tensor, matrix: List[float], interpolation: str = "nearest", fill: Optional[List[float]] = None
) -> Tensor:
_assert_grid_transform_inputs(img, matrix, interpolation, fill, ["nearest", "bilinear"])
dtype = img.dtype if torch.is_floating_point(img) else torch.float32
theta = torch.tensor(matrix, dtype=dtype, device=img.device).reshape(1, 2, 3)
shape = img.shape
# grid will be generated on the same device as theta and img
grid = _gen_affine_grid(theta, w=shape[-1], h=shape[-2], ow=shape[-1], oh=shape[-2])
return _apply_grid_transform(img, grid, interpolation, fill=fill)

I have debugged and find grid.sum().backward() will have gradient w.r.t. matrix, but L704 loses the gradient information, which means _apply_grid_transform(img, grid, interpolation, fill=fill) has no gradient w.r.t. grid.

However, it relies on torch.nn.functional.grid_sample, which seems to be a C++ implementation. Any further thoughts on how to make it differential?

I tried kornia.geometry.transform.affine (link) and it works well (note: need to use their function kornia.geometry.transform.imgwarp.get_affine_matrix2d to generate matrix, rather than the matrix generated by torchvision).

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 28, 2021

@ain-soph I agree with @datumbox that prior to the PR it would be better to open a feature request issue and give motivations about the requested feature.

According to the code in the PR, your idea is to propagate gradients through the transformations and make sure that arguments could be of type nn.Parameters or tensors with grads. I think if we could make that without any BC breaking changes, it would be nice otherwise we have to argue why it is more important to keep gradients and make args learnable vs previously working API.

As for grid_sample op and grad propagation for grid, this is done by pytorch. For example:

import torch
from torch.nn.functional import grid_sample

img = torch.rand(2, 3, 4, 4, requires_grad=True)
grid = torch.rand(2, 6, 6, 2, requires_grad=True)

out = grid_sample(img, grid, align_corners=False)
assert out.requires_grad
out.mean().backward()
img.grad is not None, grid.grad is not None
> True, True

@ain-soph
Copy link
Contributor Author

ain-soph commented Nov 28, 2021

@vfdev-5 Thanks for your response!

The backward compatibility is guaranteed for sure. I only worried that the codes might be a little diffuse (e.g., the new added _get_inverse_affine_matrix_tensor and the data format transforms). I would be happy if anyone could simplify my codes.

You are right, grid_sample has backward gradient. I've tested more cases and find the reason why it doesn't work here is that mode="nearest". If I change it to bilinear everything works.
Not sure if it is the expected behavior to be all zero gradient when mode="nearest". It seems correct to me that nearest sampling makes the gradient irrelevant to grid. Maybe we should modify the default interpolation mode to be bilinear?

I'll later propose a feature request issue linked to this PR.
The main motivation is for research purpose. Faster Autoaugment proposes to search for augment architectures using a DARTS-like framework, and all magnitudes and weights are trainable parameters. This requires all operations to have gradients w.r.t. magnitudes. This idea provides a faster search strategy as state-of-the-art AutoAugment policy search algorithms.
This work has been maintained by autoalbument and applied on some industrial scenarios from their document claims.

I'm currently doing some extensive research based on Faster Autoaugment, and I think adding the backward feature wrt magnitudes would be more convenient and support future research as well.

@ain-soph ain-soph changed the title Make operations differential Make transforms.functional_tensor functions differential w.r.t. their parameters Nov 28, 2021
@ain-soph
Copy link
Contributor Author

ain-soph commented Nov 28, 2021

@datumbox

  1. I think the rotate function seems to be unnecessary since it's already covered by affine. It might be a better option to deprecate rotate to make API more unified. But it's totally fine to keep current structure.

  2. It would be helpful if there are default values for arguments (e.g., angle=0.0, shear=[0.0, 0.0], transpose=[0.0, 0.0]) in affine, so that we only need to call affine(tensor, shear=shear) without setting other arguments.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Nov 30, 2021

I think the rotate function seems to be unnecessary since it's already covered by affine. It might be a better option to deprecate rotate to make API more unified. But it's totally fine to keep current structure.

@ain-soph due to historical reasons it was first implemented rotate and then affine op. Rotate API is not completely redundant compared to affine op, as rotate has flag expand which is not present for affine op.
I do not think that deprecating this op in favor of affine is a good idea now. Replacing the implementation of rotate and use affine behind can be checked, especially in terms of performances on PIL and tensors (it may appear that rotate can be a bit faster then affine with angle only)

It would be helpful if there are default values for arguments (e.g., angle=0.0, shear=[0.0, 0.0], transpose=[0.0, 0.0]) in affine, so that we only need to call affine(tensor, shear=shear) without setting other arguments.

Yes, I think default values could be helpful in this case.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 7, 2021

Not sure if it is the expected behavior to be all zero gradient when mode="nearest". It seems correct to me that nearest sampling makes the gradient irrelevant to grid.

yes, it is expected (https://github.com/pytorch/pytorch/blob/ca945d989ac56e0070f82fbe657d078610043ea3/aten/src/ATen/native/cpu/GridSamplerKernel.cpp#L806). Also, take a look at section 3.3 "Differentiable Image Sampling" of this paper where they derive forward pass formulas for nearest and bilinear modes and how to compute grid gradients for bilinear case.

Maybe we should modify the default interpolation mode to be bilinear?

I think we should not change default value to cover particular case of non-zero grid grads. For you use-case you can specify interpolation mode and thus obtain will obtain non-zero grads.

I could prototype a bit on making F.rotate differentiable and looks like it could be rather easily done. JIT tests currently is not passing due to addition type support but IMO this could be solved. vfdev-5@1214e6f

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 9, 2021

@ain-soph let's make the following to avoid debugging on the CI, I'll finish up the code for rotate that will pass JIT etc in another PR and you could then make other color-related transforms and affine. What do you think ?

@ain-soph
Copy link
Contributor Author

ain-soph commented Dec 9, 2021

@vfdev-5 Oops, sorry if it causes any trouble when I debug...

I try to debug locally, but it seems I have to do some extra work from source code to final binary version.

I'll appreciate if you can help me with JIT issues. I have no experience with JIT previously. I'm still facing a ton of errors, even after I have fixed a bunch of them.

@ain-soph
Copy link
Contributor Author

ain-soph commented Dec 9, 2021

@vfdev-5 I've get my local environment ready for debugging, and it seems all tests have passed except the new differentiable test that you added.

Here are the only 2 issues remaining:

  • The F.rotate is differentiable, but the JIT version is not (output not requiring gradient).

    My personal guess is that the current type of angle declared in rotate is float, which makes the Tensor converted to float in JIT script function. Therefore, the gradient information is lost. and the output tensor .requires_grad=False.

    test_differentiable_rotate[None-fn1] - test.test_functional_tensor.TestRotate
    
    Traceback (most recent call last):
      File "/root/project/test/test_functional_tensor.py", line 164, in test_differentiable_rotate
        assert y.requires_grad
    AssertionError: assert False
     +  where False = tensor([[[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n          [0.,...0., 0., 0.],\n          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],\n          [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]]]).requires_grad
    
  • the current type of center declared in rotate is List[float], but the test example passes Tensor
    test_differentiable_rotate[center1-fn1] - test.test_functional_tensor.TestRotate
    
    Traceback (most recent call last):
      File "/root/project/test/test_functional_tensor.py", line 163, in test_differentiable_rotate
        y = fn(x, alpha, interpolation=BILINEAR, center=center)
    RuntimeError: rotate() Expected a value of type 'Optional[List[int]]' for argument 'center' but instead found type 'Tensor'.
    Position: 4
    Value: tensor([0.1000, 0.2000], requires_grad=True)
    Declaration: rotate(Tensor img, float angle, Enum<__torch__.torchvision.transforms.functional.InterpolationMode> interpolation=Enum<InterpolationMode.NEAREST>, bool expand=False, int[]? center=None, float[]? fill=None, int? resample=None) -> (Tensor)
    Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)
    

I wonder if it's necessary to make the type of those arguments to be Union[XXX, Tensor]. Since the union type seems to raise a lot of new issues in JIT.

@ain-soph
Copy link
Contributor Author

ain-soph commented Dec 9, 2021

And another CI error that I can't reproduce at local environment (These tests passed sucessfully at my local machine):

test_random_rotate[85-InterpolationMode.NEAREST-degrees3-True-center0-cuda] - test.test_transforms_tensor


Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 555, in test_random_rotate
    _test_transform_vs_scripted(transform, s_transform, tensor)
  File "C:\Users\circleci\project\test\test_transforms_tensor.py", line 29, in _test_transform_vs_scripted
    assert_equal(out1, out2, msg=msg)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1217, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 997, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 3 / 9333 (0.0%)
Greatest absolute difference: 43 at index (0, 3, 2) (up to 1e-06 allowed)
Greatest relative difference: 0.5058823823928833 at index (0, 3, 2) (up to 0 allowed)

I'll try to add more test scenarios after solving the current issues.

@ain-soph
Copy link
Contributor Author

close in favor of #5110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants