[proto] Reduce number of calls of __torch_function__ #6681

vfdev-5 · 2022-10-03T12:03:58Z

An attempt to improve API v2 performances by reducing the number of _Feature.__torch_function__ calls.

CProfiling classification AA(imagenet) + RandomErasing (code) :

# python -u main.py cprofile_pil_vs_feature --n=2000
Profile API v2 on Feature
Compose(
      ToImageTensor()
      RandomResizedCrop(size=(224, 224), scale=(0.08, 1.0), ratio=(0.75, 1.3333333333333333), interpolation=InterpolationMode.BILINEAR, antialias=True)
      RandomHorizontalFlip(p=1.0)
      AutoAugment(interpolation=InterpolationMode.BILINEAR, policy=AutoAugmentPolicy.IMAGENET)
      ConvertImageDtype()
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], inplace=False)
      RandomErasing(p=1.0, scale=(0.02, 0.33), ratio=(0.3, 3.3), value=0, inplace=False)
)

Main:

         1827041 function calls (1792534 primitive calls) in 6.310 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    1.450    0.001    1.450    0.001 {built-in method torch._C._nn._upsample_bilinear2d_aa}
15251/13251    0.518    0.000    0.565    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     2628    0.466    0.000    0.855    0.000 /vision/torchvision/transforms/functional_tensor.py:875(_scale_channel)
      309    0.233    0.001    0.233    0.001 {built-in method torch.grid_sampler}
    96931    0.208    0.000    0.997    0.000 /vision/torchvision/prototype/features/_feature.py:71(__torch_function__)
4000/2000    0.201    0.000    0.208    0.000 {method 'flip' of 'torch._C._TensorBase' objects}
     2628    0.175    0.000    0.175    0.000 {built-in method torch.bincount}
4062/4031    0.171    0.000    0.172    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
...

This PR:

         1269905 function calls (1243829 primitive calls) in 5.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    1.454    0.001    1.454    0.001 {built-in method torch._C._nn._upsample_bilinear2d_aa}
15251/13251    0.527    0.000    0.551    0.000 {method 'to' of 'torch._C._TensorBase' objects}
     2628    0.477    0.000    0.875    0.000 /vision/torchvision/transforms/functional_tensor.py:875(_scale_channel)
      309    0.231    0.001    0.231    0.001 {built-in method torch.grid_sampler}
4000/2000    0.203    0.000    0.216    0.000 {method 'flip' of 'torch._C._TensorBase' objects}
     2628    0.177    0.000    0.177    0.000 {built-in method torch.bincount}
4062/4031    0.171    0.000    0.171    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
...
    18305    0.084    0.000    0.757    0.000 /vision/torchvision/prototype/features/_feature.py:74(__torch_function__)
     4000    0.073    0.000    0.754    0.000 /vision/torchvision/prototype/transforms/_transform.py:66(forward)
     2000    0.067    0.000    0.154    0.000 /vision/torchvision/transforms/functional_tensor.py:943(erase)
...

Time measurements on all classification dataaug pipelines. Improvements are ~1ms:

cc @pmeier @datumbox

datumbox · 2022-10-03T16:33:31Z

torchvision/prototype/features/_feature.py

+    # this way we return the result without passing into __torch_function__
+    @property
+    def shape(self):
+        return self.as_subclass(torch.Tensor).shape


Have we tried maintaining the original reference of the input tensor data prior getting wrapped in the subclassed tensor? @ezyang proposed this idea offline as a means to speed things up. He just cautioned to keep this private because if a user gets hold of the original reference and do original.resize_() this change will be available only on that tensor and not on the wrapped one.

Is this really faster than going into torch function? You have to do a tensor allocation here!

pmeier

IIUC, we are only using .as_subclass(torch.Tensor) here to avoid __torch_function__, correct? If yes, why not simply use the context manager that disables this?

vision/torchvision/prototype/features/_feature.py

Lines 102 to 103 in 07ae61b

    
           with DisableTorchFunction(): 
        
               output = func(*args, **kwargs or dict())

like

with DisableTorchFunction():
    return self.shape

Better yet, instead of adding properties for all these attributes manually, why not overwrite __getattribute__ to do this for us?

    _NO_TORCH_FUNCTION = {
        "shape",
    }

    def __getattribute__(self, item):
        with DisableTorchFunction() if item in type(self)._NO_TORCH_FUNCTION else contextlib.nullcontext():
            return object.__getattribute__(self, item)

Test:

import unittest.mock

import torch
from torchvision.prototype import features

image = features.Image(torch.rand(3, 16, 16))

with unittest.mock.patch(
    "torchvision.prototype.features._Feature.__torch_function__",
    side_effect=AssertionError,
):
    image.shape

I have not benchmarked this. If it is as fast or faster than your suggestion, I would prefer this for several reasons:

We don't need to add properties manual, which is harder to maintain.
Subclasses can extend _NO_TORCH_FUNCTION or whatever we call it to also avoid __torch_function__ calls on the metadata they are adding, i.e. Image.color_space.
Since we don't need to unwrap anymore, we can also avoid the issues @datumbox mentioned in [proto] Reduce number of calls of __torch_function__ #6681 (comment).

pmeier · 2022-10-04T07:07:33Z

torchvision/prototype/features/_image.py

+        # Question: Is it safe to assume data to be a tensor ?
+        out = data.as_subclass(Image)


No. features.Label(0) is a good counter example.

This code is inside Image, Label wont pass here IMO

vfdev-5 · 2022-10-04T07:40:23Z

IIUC, we are only using .as_subclass(torch.Tensor) here to avoid torch_function here, correct? If yes, why not simply use the context manager that disables this?

Yes, it is to disable torch function. I can benchmark but I assume that using the context manager will be slower as we should be calling more functions (e.g. enter/exit of CM). Let's see.

vfdev-5 · 2022-10-04T16:10:54Z

Here are some time measurements for different options:

import torch
from torch._C import DisableTorchFunction


class ATensor(torch.Tensor):
    
    def __new__(
        cls,
        data,
        *,
        dtype=None,
        device=None,
        requires_grad: bool = False,
    ):
        data = torch.as_tensor(  # type: ignore[return-value]
            data,
            dtype=dtype,  # type: ignore[arg-type]
            device=device,  # type: ignore[arg-type]
        )
        
        output = data.as_subclass(cls).requires_grad_(requires_grad)
        output._data_tensor = data
        return output
    
    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        return super().__torch_function__(func, types, args, kwargs)
    
    @property
    def shape2(self):
        return self.as_subclass(torch.Tensor).shape

    @property
    def shape3(self):
        return self._data_tensor.shape
    
    @property
    def shape4(self):
        with DisableTorchFunction():
            return self.shape

a = ATensor(torch.tensor([0, 1, 2]))

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape)  # original

462 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape2)  # self.as_subclass(torch.Tensor)

189 µs ± 1.4 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape3)  # self._data_tensor

33.2 µs ± 552 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
s = 0
for _ in range(100):
    s += sum(a.shape4)  # DisableTorchFunction

54.3 µs ± 661 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The fastest is using a ref to original torch tensor and the second fastest is DisableTorchFunction.
Between these two options I think DisableTorchFunction is better as there are no side-effects with inplace ops on underlying tensor.

datumbox · 2022-10-04T19:04:44Z

@vfdev-5 Sounds good. Let's proceed with DisableTorchFunction. IMO moving with properties is OK. It makes the code readable and simpler comparing to using __getattribute__. I would assume it's also marginally faster, as it avoids additional ops. The cost of maintaining them is not that big, only few lines of code placed in the _Feature class.

Looking forward to seeing what impact this has on the overall performance.

vfdev-5 · 2022-10-05T17:08:57Z

Looking forward to seeing what impact this has on the overall performance.

Overall impact of such changes is about ~0.1 ms vs main.

Using ref is a bit faster that using DisableTorchFunction:

However, there are few blockers/issues in the implementation (not yet pushed here)... WIP

ezyang · 2022-10-05T20:16:49Z

By the way, the core team is open to optimizing the relevant apis to make them faster. We didn't really do any benchmarking on them. One easy win is to have subclasses preregister the set of __torch_function__ they are interested in interposing on, and fastpathing when they are not.

vfdev-5 · 2022-10-05T22:15:40Z

One easy win is to have subclasses preregister the set of torch_function they are interested in interposing on, and fastpathing when they are not.

@ezyang yes, this could be helpful !

datumbox · 2022-10-06T08:24:35Z

@ezyang Thanks for offering help. Absolutely, we would love to work with you to make the solution faster. Perhaps we can start by a quick review on our practices (I believe this PR is a good place to start), to let us know if we make improper use of some of the Subclassing features and whether there is any low hanging fruits we can improve. Happy to chat more about dedicating some time to improve the solution.

ezyang · 2022-10-06T15:17:31Z

The lines edited by this PR don't look obviously wrong but I haven't looked at the whole thing.

@vfdev-5 are you open to implementing some of the core optimizations? We're pretty low bandwidth on our side, but I can definitely find time to give design and review

vfdev-5 · 2022-10-06T16:19:03Z

@ezyang we are right now very tight on the schedule for v2, but yes, I can be open to implement some core optimization a bit later once we are done with python optims. Thanks for proposing, let me ping you once we have time to put an effort there.

…ure-improvements

vfdev-5 · 2022-10-10T12:51:18Z

After merging #6718 perf improvements are less important in terms of time measurements (~ 0.01 ms), however we can reduce the number of __torch_function__ calls.

Main: 65932 0.156 0.000 0.888 0.000 /vision/torchvision/prototype/features/_feature.py:62(__torch_function__)
This PR: 16747 0.073 0.000 0.767 0.000 /vision/torchvision/prototype/features/_feature.py:67(__torch_function__)

datumbox · 2022-10-10T13:06:06Z

torchvision/prototype/features/_encoded.py

@@ -16,7 +16,9 @@
 class EncodedData(_Feature):
    @classmethod
    def _wrap(cls: Type[D], tensor: torch.Tensor) -> D:
-        return tensor.as_subclass(cls)
+        output = tensor.as_subclass(cls)
+        output._tensor = tensor


Originally you had concerns about adopting this approach in order to avoid issues with in-place operators. Assuming we keep this reference private, are we safe from in-place operators? Aka do the shapes of img and img._tensor match on the following?

x = torch.randn((10,1,1,1)) img = Image(x) img.resize_((10, 1, 10, 10))

No, we are not safe with resize_-like ops and the shapes wont match...
This is a drawback of using tensor ref...

Then probably that's something we would like to avoid, right? Shall we proceed with your original proposal (#6681 (comment)) to use DisableTorchFunction? My understanding is that this would be safe for inplace ops with little performance penalty?

Yes, it would be better to avoid that. I'll check DisableTorchFunction again but seeing that the gain is very tiny and around benchmarking noise maybe we can search for other optims now

No, we are not safe with resize_-like ops and the shapes wont match...
This is a drawback of using tensor ref...

IIUC, this should not be that much of an issue. We are already special casing inplace ops:

vision/torchvision/prototype/features/_feature.py

Lines 106 to 111 in 11a2eed

# Inplace `func`'s, canonically identified with a trailing underscore in their name like `.add_(...)`,

# will retain the input type. Thus, we need to unwrap here.

if isinstance(output, cls):

return output.as_subclass(torch.Tensor)

return output

Can't we simply correct the _tensor reference there? Like

if isinstance(output, cls): output._tensor = output = output.as_subclass(torch.Tensor)

If that is too much syntax sugar, the more verbose variant would be

if isinstance(output, cls): tensor = output.as_subclass(torch.Tensor) output._tensor = tensor return tensor

I'm open to it if we can guarantee it's safe. If we go ahead with keeping a reference, is it possible to do a bit of refactoring to use __ and make it much harder for users to shoot themselves in the foot? I suspect the issue here is subclassing of our _Feature but I'm open to ideas.

@vfdev-5 WDYT of Philip's proposal?

I check that later, but as commented previously, I do not see that this update brings much perf gain, just very tiny speed-up. Let me come back to this PR later. I'll put it as draft now.

Merge branch 'main' of github.com:pytorch/vision into proto-perf-feature-improvements

vfdev-5 · 2022-10-14T14:57:00Z

With DisableTorchFunction (3799ce7) we have :

- 58336    0.147    0.000    0.513    0.000 /vision/torchvision/prototype/features/_feature.py:62(__torch_function__)
+ 16747    0.070    0.000    0.376    0.000 /vision/torchvision/prototype/features/_feature.py:64(__torch_function__)

Sources: before, after

Time diffs for classification pipelines, one example:

[ Classification AA=ra RE=1.0 transforms measurements ]
                          |    v2 
1 threads: -----------------------
      Tensor Image data   |  3.803
-      Feature Image data  |  3.986
+      Feature Image data  |  3.931

Sources: before, after

This reverts commit 38f8e21.

…ure-improvements

vfdev-5 · 2022-10-14T16:12:30Z

Results for _tensor approach (38f8e21):

[ Classification AA=augmix RE=1.0 transforms measurements ]
                          |    v2 
1 threads: -----------------------
-      Tensor Image data   |  7.846
-      Feature Image data  |  7.916
+      Tensor Image data   |  7.798     <---- this is quicker than the same test  <-> noise
+      Feature Image data  |  7.875

Source: DisableTorchFunction approach, _tensor approach

pmeier

One nit inline. Otherwise LGTM, thanks Victor!

One question though for my own understanding: we don't need this for the feature specific attributes like Image.color_space since they are not part of __torch_function__ anyway, right?

torchvision/prototype/features/_feature.py

Co-authored-by: Philip Meier <[email protected]>

vfdev-5 · 2022-10-17T07:38:16Z

we don't need this for the feature specific attributes like Image.color_space since they are not part of torch_function anyway

Yes, seems like additional attributes like color_space are not routed to __torch_function__, so no need to make them as properties.

github-actions · 2022-10-17T08:00:05Z

Hey @vfdev-5!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Summary: * [proto] Reduce number of calls of __torch_function__ * Use DisableTorchFunction and super * Use self._tensor * Fixes mypy and color space handling * revert Image.new_like * WIP * Perf opt with ref to tensor and properties * Removed requires_grad property * Use _tensor ref * Revert "Use _tensor ref" This reverts commit 38f8e21. * Update torchvision/prototype/features/_feature.py Reviewed By: NicolasHug Differential Revision: D40427451 fbshipit-source-id: f241b2d48823d3612943b79887da2c8d1f482160 Co-authored-by: Philip Meier <[email protected]> Co-authored-by: Philip Meier <[email protected]>

[proto] Reduce number of calls of __torch_function__

4767fef

facebook-github-bot added the cla signed label Oct 3, 2022

vfdev-5 marked this pull request as draft October 3, 2022 12:40

datumbox reviewed Oct 3, 2022

View reviewed changes

pmeier reviewed Oct 4, 2022

View reviewed changes

vfdev-5 added 3 commits October 5, 2022 11:05

Use DisableTorchFunction and super

5ba3c0d

Use self._tensor

6435c09

Fixes mypy and color space handling

b7694e6

revert Image.new_like

628826c

pmeier mentioned this pull request Oct 7, 2022

replace new_like with wrap_like #6718

Merged

WIP

345790b

pmeier mentioned this pull request Oct 7, 2022

add Video feature and kernels #6667

Merged

vfdev-5 added 4 commits October 10, 2022 10:08

Merge branch 'main' of github.com:pytorch/vision into proto-perf-feat…

1a11cf1

…ure-improvements

Perf opt with ref to tensor and properties

db0eef5

Merge branch 'main' into proto-perf-feature-improvements

51a14c1

Removed requires_grad property

c8b3ac8

vfdev-5 marked this pull request as ready for review October 10, 2022 12:51

datumbox reviewed Oct 10, 2022

View reviewed changes

vfdev-5 marked this pull request as draft October 12, 2022 10:11

Uses DisableTorchFunction again

3799ce7

Merge branch 'main' of github.com:pytorch/vision into proto-perf-feature-improvements

vfdev-5 force-pushed the proto-perf-feature-improvements branch from 30c1ae9 to 3799ce7 Compare October 14, 2022 11:11

vfdev-5 added 3 commits October 14, 2022 15:40

Use _tensor ref

38f8e21

Revert "Use _tensor ref"

8441dbc

This reverts commit 38f8e21.

Merge branch 'main' of github.com:pytorch/vision into proto-perf-feat…

99c3594

…ure-improvements

vfdev-5 force-pushed the proto-perf-feature-improvements branch from 8348759 to 99c3594 Compare October 14, 2022 16:00

vfdev-5 marked this pull request as ready for review October 14, 2022 20:41

Merge branch 'main' into proto-perf-feature-improvements

6310df9

pmeier approved these changes Oct 17, 2022

View reviewed changes

torchvision/prototype/features/_feature.py Outdated Show resolved Hide resolved

Update torchvision/prototype/features/_feature.py

7e8eb46

Co-authored-by: Philip Meier <[email protected]>

vfdev-5 merged commit 149edda into pytorch:main Oct 17, 2022

vfdev-5 deleted the proto-perf-feature-improvements branch October 17, 2022 07:59

vfdev-5 added Perf For performance improvements prototype module: transforms labels Oct 17, 2022

pmeier mentioned this pull request Oct 17, 2022

Remove remaining __torch_function__ calls from regular transform pipeline #6781

Closed

pmeier mentioned this pull request Oct 20, 2022

Unwrap features before passing them into a kernel #6807

Merged

	with DisableTorchFunction():
	output = func(args, *kwargs or dict())

		# Question: Is it safe to assume data to be a tensor ?
		out = data.as_subclass(Image)

	# Inplace `func`'s, canonically identified with a trailing underscore in their name like `.add_(...)`,
	# will retain the input type. Thus, we need to unwrap here.
	if isinstance(output, cls):
	return output.as_subclass(torch.Tensor)

	return output

[proto] Reduce number of calls of __torch_function__ #6681

[proto] Reduce number of calls of __torch_function__ #6681

Uh oh!

Conversation

vfdev-5 commented Oct 3, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Oct 4, 2022

Uh oh!

vfdev-5 commented Oct 4, 2022

Uh oh!

datumbox commented Oct 4, 2022

Uh oh!

vfdev-5 commented Oct 5, 2022

Uh oh!

ezyang commented Oct 5, 2022

Uh oh!

vfdev-5 commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datumbox commented Oct 6, 2022

Uh oh!

ezyang commented Oct 6, 2022

Uh oh!

vfdev-5 commented Oct 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Oct 10, 2022

Uh oh!

datumbox Oct 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vfdev-5 commented Oct 14, 2022

Uh oh!

vfdev-5 commented Oct 14, 2022

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vfdev-5 commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 17, 2022

Uh oh!

Uh oh!

pmeier left a comment •

edited

Loading

vfdev-5 commented Oct 5, 2022 •

edited

Loading

vfdev-5 commented Oct 6, 2022 •

edited

Loading

datumbox Oct 10, 2022 •

edited

Loading

vfdev-5 commented Oct 17, 2022 •

edited

Loading