Skip to content

[proto] Reduce number of calls of __torch_function__ #6681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 17, 2022

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Oct 3, 2022

An attempt to improve API v2 performances by reducing the number of _Feature.__torch_function__ calls.

CProfiling classification AA(imagenet) + RandomErasing (code) :

# python -u main.py cprofile_pil_vs_feature --n=2000
Profile API v2 on Feature
Compose(
      ToImageTensor()
      RandomResizedCrop(size=(224, 224), scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=InterpolationMode.BILINEAR, antialias=True)
      RandomHorizontalFlip(p=1.0)
      AutoAugment(interpolation=InterpolationMode.BILINEAR, policy=AutoAugmentPolicy.IMAGENET)
      ConvertImageDtype()
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], inplace=False)
      RandomErasing(p=1.0, scale=(0.02, 0.33), ratio=(0.3, 3.3), value=0, inplace=False)
)

Main:

         1827041 function calls (1792534 primitive calls) in 6.310 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    1.450    0.001    1.450    0.001 {built-in method torch._C._nn._upsample_bilinear2d_aa}
15251/13251    0.518    0.000    0.565    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     2628    0.466    0.000    0.855    0.000 /vision/torchvision/transforms/functional_tensor.py:875(_scale_channel)
      309    0.233    0.001    0.233    0.001 {built-in method torch.grid_sampler}
    96931    0.208    0.000    0.997    0.000 /vision/torchvision/prototype/features/_feature.py:71(__torch_function__)
4000/2000    0.201    0.000    0.208    0.000 {method 'flip' of 'torch._C._TensorBase' objects}
     2628    0.175    0.000    0.175    0.000 {built-in method torch.bincount}
4062/4031    0.171    0.000    0.172    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
...

This PR:

         1269905 function calls (1243829 primitive calls) in 5.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    1.454    0.001    1.454    0.001 {built-in method torch._C._nn._upsample_bilinear2d_aa}
15251/13251    0.527    0.000    0.551    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     2628    0.477    0.000    0.875    0.000 /vision/torchvision/transforms/functional_tensor.py:875(_scale_channel)
      309    0.231    0.001    0.231    0.001 {built-in method torch.grid_sampler}
4000/2000    0.203    0.000    0.216    0.000 {method 'flip' of 'torch._C._TensorBase' objects}
     2628    0.177    0.000    0.177    0.000 {built-in method torch.bincount}
4062/4031    0.171    0.000    0.171    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
...
    18305    0.084    0.000    0.757    0.000 /vision/torchvision/prototype/features/_feature.py:74(__torch_function__)
     4000    0.073    0.000    0.754    0.000 /vision/torchvision/prototype/transforms/_transform.py:66(forward)
     2000    0.067    0.000    0.154    0.000 /vision/torchvision/transforms/functional_tensor.py:943(erase)
...

Time measurements on all classification dataaug pipelines. Improvements are ~1ms:

cc @pmeier @datumbox

@vfdev-5 vfdev-5 marked this pull request as draft October 3, 2022 12:40
# this way we return the result without passing into __torch_function__
@property
def shape(self):
return self.as_subclass(torch.Tensor).shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried maintaining the original reference of the input tensor data prior getting wrapped in the subclassed tensor? @ezyang proposed this idea offline as a means to speed things up. He just cautioned to keep this private because if a user gets hold of the original reference and do original.resize_() this change will be available only on that tensor and not on the wrapped one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really faster than going into torch function? You have to do a tensor allocation here!

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we are only using .as_subclass(torch.Tensor) here to avoid __torch_function__, correct? If yes, why not simply use the context manager that disables this?

with DisableTorchFunction():
output = func(*args, **kwargs or dict())

like

with DisableTorchFunction():
    return self.shape

Better yet, instead of adding properties for all these attributes manually, why not overwrite __getattribute__ to do this for us?

    _NO_TORCH_FUNCTION = {
        "shape",
    }

    def __getattribute__(self, item):
        with DisableTorchFunction() if item in type(self)._NO_TORCH_FUNCTION else contextlib.nullcontext():
            return object.__getattribute__(self, item)

Test:

import unittest.mock

import torch
from torchvision.prototype import features

image = features.Image(torch.rand(3, 16, 16))

with unittest.mock.patch(
    "torchvision.prototype.features._Feature.__torch_function__",
    side_effect=AssertionError,
):
    image.shape

I have not benchmarked this. If it is as fast or faster than your suggestion, I would prefer this for several reasons:

  1. We don't need to add properties manual, which is harder to maintain.
  2. Subclasses can extend _NO_TORCH_FUNCTION or whatever we call it to also avoid __torch_function__ calls on the metadata they are adding, i.e. Image.color_space.
  3. Since we don't need to unwrap anymore, we can also avoid the issues @datumbox mentioned in [proto] Reduce number of calls of __torch_function__ #6681 (comment).

Comment on lines 100 to 101
# Question: Is it safe to assume data to be a tensor ?
out = data.as_subclass(Image)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. features.Label(0) is a good counter example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is inside Image, Label wont pass here IMO

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 4, 2022

IIUC, we are only using .as_subclass(torch.Tensor) here to avoid torch_function here, correct? If yes, why not simply use the context manager that disables this?

Yes, it is to disable torch function. I can benchmark but I assume that using the context manager will be slower as we should be calling more functions (e.g. enter/exit of CM). Let's see.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 4, 2022

Here are some time measurements for different options:

import torch
from torch._C import DisableTorchFunction


class ATensor(torch.Tensor):
    
    def __new__(
        cls,
        data,
        *,
        dtype=None,
        device=None,
        requires_grad: bool = False,
    ):
        data = torch.as_tensor(  # type: ignore[return-value]
            data,
            dtype=dtype,  # type: ignore[arg-type]
            device=device,  # type: ignore[arg-type]
        )
        
        output = data.as_subclass(cls).requires_grad_(requires_grad)
        output._data_tensor = data
        return output
    
    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        return super().__torch_function__(func, types, args, kwargs)
    
    @property
    def shape2(self):
        return self.as_subclass(torch.Tensor).shape

    @property
    def shape3(self):
        return self._data_tensor.shape
    
    @property
    def shape4(self):
        with DisableTorchFunction():
            return self.shape

a = ATensor(torch.tensor([0, 1, 2]))
%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape)  # original

462 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape2)  # self.as_subclass(torch.Tensor)

189 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape3)  # self._data_tensor

33.2 µs ± 552 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape4)  # DisableTorchFunction

54.3 µs ± 661 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The fastest is using a ref to original torch tensor and the second fastest is DisableTorchFunction.
Between these two options I think DisableTorchFunction is better as there are no side-effects with inplace ops on underlying tensor.

@datumbox
Copy link
Contributor

datumbox commented Oct 4, 2022

@vfdev-5 Sounds good. Let's proceed with DisableTorchFunction. IMO moving with properties is OK. It makes the code readable and simpler comparing to using __getattribute__. I would assume it's also marginally faster, as it avoids additional ops. The cost of maintaining them is not that big, only few lines of code placed in the _Feature class.

Looking forward to seeing what impact this has on the overall performance.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 5, 2022

Looking forward to seeing what impact this has on the overall performance.

Overall impact of such changes is about ~0.1 ms vs main.

Using ref is a bit faster that using DisableTorchFunction:

However, there are few blockers/issues in the implementation (not yet pushed here)... WIP

@ezyang
Copy link
Contributor

ezyang commented Oct 5, 2022

By the way, the core team is open to optimizing the relevant apis to make them faster. We didn't really do any benchmarking on them. One easy win is to have subclasses preregister the set of __torch_function__ they are interested in interposing on, and fastpathing when they are not.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 5, 2022

One easy win is to have subclasses preregister the set of torch_function they are interested in interposing on, and fastpathing when they are not.

@ezyang yes, this could be helpful !

@datumbox
Copy link
Contributor

datumbox commented Oct 6, 2022

@ezyang Thanks for offering help. Absolutely, we would love to work with you to make the solution faster. Perhaps we can start by a quick review on our practices (I believe this PR is a good place to start), to let us know if we make improper use of some of the Subclassing features and whether there is any low hanging fruits we can improve. Happy to chat more about dedicating some time to improve the solution.

@ezyang
Copy link
Contributor

ezyang commented Oct 6, 2022

The lines edited by this PR don't look obviously wrong but I haven't looked at the whole thing.

@vfdev-5 are you open to implementing some of the core optimizations? We're pretty low bandwidth on our side, but I can definitely find time to give design and review

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 6, 2022

@ezyang we are right now very tight on the schedule for v2, but yes, I can be open to implement some core optimization a bit later once we are done with python optims. Thanks for proposing, let me ping you once we have time to put an effort there.

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 10, 2022

After merging #6718 perf improvements are less important in terms of time measurements (~ 0.01 ms), however we can reduce the number of __torch_function__ calls.

Main: 65932 0.156 0.000 0.888 0.000 /vision/torchvision/prototype/features/_feature.py:62(__torch_function__)
This PR: 16747 0.073 0.000 0.767 0.000 /vision/torchvision/prototype/features/_feature.py:67(__torch_function__)

@vfdev-5 vfdev-5 marked this pull request as ready for review October 10, 2022 12:51
@@ -16,7 +16,9 @@
class EncodedData(_Feature):
@classmethod
def _wrap(cls: Type[D], tensor: torch.Tensor) -> D:
return tensor.as_subclass(cls)
output = tensor.as_subclass(cls)
output._tensor = tensor
Copy link
Contributor

@datumbox datumbox Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally you had concerns about adopting this approach in order to avoid issues with in-place operators. Assuming we keep this reference private, are we safe from in-place operators? Aka do the shapes of img and img._tensor match on the following?

x = torch.randn((10,1,1,1))
img = Image(x)
img.resize_((10, 1, 10, 10))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not safe with resize_-like ops and the shapes wont match...
This is a drawback of using tensor ref...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then probably that's something we would like to avoid, right? Shall we proceed with your original proposal (#6681 (comment)) to use DisableTorchFunction? My understanding is that this would be safe for inplace ops with little performance penalty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be better to avoid that. I'll check DisableTorchFunction again but seeing that the gain is very tiny and around benchmarking noise maybe we can search for other optims now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not safe with resize_-like ops and the shapes wont match...
This is a drawback of using tensor ref...

IIUC, this should not be that much of an issue. We are already special casing inplace ops:

# Inplace `func`'s, canonically identified with a trailing underscore in their name like `.add_(...)`,
# will retain the input type. Thus, we need to unwrap here.
if isinstance(output, cls):
return output.as_subclass(torch.Tensor)
return output

Can't we simply correct the _tensor reference there? Like

if isinstance(output, cls): 
    output._tensor = output = output.as_subclass(torch.Tensor)

If that is too much syntax sugar, the more verbose variant would be

if isinstance(output, cls):
    tensor = output.as_subclass(torch.Tensor)
    output._tensor = tensor
    return tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it if we can guarantee it's safe. If we go ahead with keeping a reference, is it possible to do a bit of refactoring to use __ and make it much harder for users to shoot themselves in the foot? I suspect the issue here is subclassing of our _Feature but I'm open to ideas.

@vfdev-5 WDYT of Philip's proposal?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I check that later, but as commented previously, I do not see that this update brings much perf gain, just very tiny speed-up. Let me come back to this PR later. I'll put it as draft now.

@vfdev-5 vfdev-5 marked this pull request as draft October 12, 2022 10:11
Merge branch 'main' of github.com:pytorch/vision into proto-perf-feature-improvements
@vfdev-5 vfdev-5 force-pushed the proto-perf-feature-improvements branch from 30c1ae9 to 3799ce7 Compare October 14, 2022 11:11
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 14, 2022

With DisableTorchFunction (3799ce7) we have :

- 58336    0.147    0.000    0.513    0.000 /vision/torchvision/prototype/features/_feature.py:62(__torch_function__)
+ 16747    0.070    0.000    0.376    0.000 /vision/torchvision/prototype/features/_feature.py:64(__torch_function__)

Sources: before, after

Time diffs for classification pipelines, one example:

[ Classification AA=ra RE=1.0 transforms measurements ]
                          |    v2 
1 threads: -----------------------
      Tensor Image data   |  3.803
-      Feature Image data  |  3.986
+      Feature Image data  |  3.931

Sources: before, after

@vfdev-5 vfdev-5 force-pushed the proto-perf-feature-improvements branch from 8348759 to 99c3594 Compare October 14, 2022 16:00
@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 14, 2022

Results for _tensor approach (38f8e21):

[ Classification AA=augmix RE=1.0 transforms measurements ]
                          |    v2 
1 threads: -----------------------
-      Tensor Image data   |  7.846
-      Feature Image data  |  7.916
+      Tensor Image data   |  7.798     <---- this is quicker than the same test  <-> noise
+      Feature Image data  |  7.875

Source: DisableTorchFunction approach, _tensor approach

@vfdev-5 vfdev-5 marked this pull request as ready for review October 14, 2022 20:41
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit inline. Otherwise LGTM, thanks Victor!

One question though for my own understanding: we don't need this for the feature specific attributes like Image.color_space since they are not part of __torch_function__ anyway, right?

@vfdev-5
Copy link
Collaborator Author

vfdev-5 commented Oct 17, 2022

we don't need this for the feature specific attributes like Image.color_space since they are not part of torch_function anyway

Yes, seems like additional attributes like color_space are not routed to __torch_function__, so no need to make them as properties.

@vfdev-5 vfdev-5 merged commit 149edda into pytorch:main Oct 17, 2022
@vfdev-5 vfdev-5 deleted the proto-perf-feature-improvements branch October 17, 2022 07:59
@github-actions
Copy link

Hey @vfdev-5!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@vfdev-5 vfdev-5 added Perf For performance improvements prototype module: transforms labels Oct 17, 2022
facebook-github-bot pushed a commit that referenced this pull request Oct 17, 2022
Summary:
* [proto] Reduce number of calls of __torch_function__

* Use DisableTorchFunction and super

* Use self._tensor

* Fixes mypy and color space handling

* revert Image.new_like

* WIP

* Perf opt with ref to tensor and properties

* Removed requires_grad property

* Use _tensor ref

* Revert "Use _tensor ref"

This reverts commit 38f8e21.

* Update torchvision/prototype/features/_feature.py

Reviewed By: NicolasHug

Differential Revision: D40427451

fbshipit-source-id: f241b2d48823d3612943b79887da2c8d1f482160

Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants