Skip to content

API usage logging within TorchVision #5052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kazhang opened this issue Dec 8, 2021 · 19 comments · Fixed by #5038, #5007 or #5072
Closed

API usage logging within TorchVision #5052

kazhang opened this issue Dec 8, 2021 · 19 comments · Fixed by #5038, #5007 or #5072
Assignees

Comments

@kazhang
Copy link
Contributor

kazhang commented Dec 8, 2021

Goal

To understand TorchVision usage within an organization(e.g. Meta).

The events give insights into torchvision usage with regards to individual callsites, workflows etc. The organization could also learn the trending APIs, which could be used to guide component development/deprecation etc.

Policy

  • Usage should be recorded only once for the same API within a process;
  • We should record events as broadly as possible, duplicated events(e.g. module and function log the same thing) is OK and can be dedup in downstream pipelines.
  • For modules, API usage should be recorded at the beginning of constructor of the main class. For example __init__ of RegNet, but not on ones of submodules(e.g. ResBottleneckBlock)
  • For functions, API usage should be recorded at the beginning of the method;
  • For torchvision.io, the logging must be added both on the Python and the C++ (using the csrc submodule as mentioned) side.
  • On torchvision.ops, the calls should be added both on the main class of the operator (eg StochasticDepth) and on its functional equivalent (eg stochastic_depth) if available.
  • On torchvision.transforms, the calls should be placed on the constructors of the Transform classes, the Auto-Augment classes and the functional methods.
  • On torchvision.datasets, the calls are placed once on the constructor of VisionDataset so we don't need to add them individually on each dataset.
  • On torchvision.utils, call should be added to the top of each public method.

Event Format

Full qualified name of the component is logged as the event. For example: torchvision.models.resnet.ResNet
Note: for events from C++ APIs, “.csrc” should be added after torchvision, for example: torchvision.csrc.ops.nms.nms

Usage Log API

  • C++: C10_LOG_API_USAGE_ONCE()
  • Python:
from ..utils import _log_api_usage_once
# for class
_log_api_usage_once(self)
# for method
if not torch.jit.is_scripting() and not torch.jit.is_tracing():
  _log_api_usage_once(nms) 

Above APIs are lightweight. By default, they are just no-op. It’s guaranteed that the same event is only recorded once within a process. Please note that 8 GPUs will still lead to 8 events, so the events should be dedup by a unique identifier like workflow job_id.

Implementation

def _log_api_usage_once(obj: Any) -> None:
    if not obj.__module__.startswith("torchvision"):
        return
    name = obj.__class__.__name__
    if isinstance(obj, FunctionType):
        name = obj.__name__
    torch._C._log_api_usage_once(f"{obj.__module__}.{name}")

Also considered

  • log usage in base class
    • Create a base class for all models, datasets, transforms and log usage in the init of base class
    • Introducing extra abstraction only for logging seems overkill. In Base class for models #4569, we couldn't find any other features to be added to model base class; In addition, we also need a way to log non-class usage;
  • use decorator
  • use function’s __module__:
    • For example: _log_api_usage(nms.__module__, "nms")
    • doesn’t work with TorchScript: attribute lookup is not defined on function
  • use global constant for module
    • For example: _log_api_usage(MODULE, "nms")
    • doesn’t work with TorchScript;
  • use flat namespace
    • For example: log events as "torchvision.{models|transforms|datasets}.{class or function name}"
    • there might be name collisions;
  • use object or function as param in logging API
    • For example: _log_api_usage_once(self)
    • doesn’t work with TorchScript;
  • log fully qualified name with __qualname__ for class, string for function
@kazhang kazhang mentioned this issue Dec 8, 2021
@kazhang kazhang self-assigned this Dec 8, 2021
@datumbox
Copy link
Contributor

datumbox commented Dec 8, 2021

Thanks for the write up @kazhang.

I agree with the direction of the proposal. Personally I would favour torchvision.{ models | transforms | ops | io}.{function | class} but no strong opinions.

As discussed at #5038 (comment), adopting a straightforward, easy to implement and review solution will help us ensure that logging will be added on new endpoints and models. Already we've missed it in a couple of places (see #5044, admittedly I was involved on the reviews and missed this).

It might be good to include a few clarifications on how the proposal is implemented in practice, so that open-source contributors have clearer understanding of the policy. For example:

  1. On torchvision.models, the logging call should be added on the constructor of the main class (eg RegNet) not on the ones of submodules (eg ResBottleneckBlock).
  2. On torchvision.io, the logging must be added both on the Python and the C++ (using the csrc submodule as mentioned) side.
  3. On torchvision.ops, the calls should be added both on the main class of the operator (eg StochasticDepth) and on its functional equivalent (eg stochastic_depth) if available. Note that this is not the situation as of now and this needs to be updated on the ops submodule.
  4. On torchvision.transforms, the calls should be placed on the constructors of the Transform classes, the Auto-Augment classes and the functional methods.
  5. On torchvision.datasets, the calls are placed once on the constructor of VisionDataset so we don't need to add them individually on each dataset.
  6. On torchvision.utils, I assume we should put method calls on the top of each public method.

Concerning the outstanding work, my understanding is that:

  • torchvision.models is done.
  • torchvision.datasets is done.
  • torchvision.io has an open PR at Log io usage #5038 which requires minor changes but mostly done
  • torchvision.ops needs to be updated to include calls in all class constructors. I believe that the calls on all methods already exists.
  • torchvision.transforms has no logging
  • torchvision.utils has no logging

@kazhang would you be OK to open a ticket for ops, transforms and utils and provide a sample implementation for each on the issue? The open-source community can definitely help us tackle this one. :)

@kazhang
Copy link
Contributor Author

kazhang commented Dec 8, 2021

In terms of format, I personally prefer torchvision.{ models | transforms | ops | io | util}.{function | class} too, we'd need to update the format string here though. In this case, I recommend to change to logging API to

def _log_api_usage_once(module:str, name: str) -> None:
    if torch.jit.is_scripting() or torch.jit.is_tracing():
        return
    torch._C._log_api_usage_once(f"torchvision.{module}.{name}")

So that the API is consistent across module and function(as of today, module uses self, function uses string.

@datumbox
Copy link
Contributor

datumbox commented Dec 10, 2021

@kazhang After reviewing some of the PRs, I noticed that:

  • On C++ (see pr Log io usage #5038) the torchvision/csrc/io/image/cpu/decode_jpeg.cpp records to the logger torchvision.csrc.io.decode_jpeg. Strictly speaking, this method doesn't really exist and might conflict with other methods within the io namespace (perhaps the gpu equivalents if on the future manage to use the dispatcher).
  • On Python (see pr Add api usage log to transforms #5007) the torchvision.transforms.functional.get_image_size records to the logger torchvision.transforms.get_image_size. Again this doesn't really exist because the methods under functional are not imported directly to the transforms namespace.
  • On Python (see pr revamp log api usage method #5072), the torchvision.models.detection.RetinaNet is recorded as torchvision.models.RetinaNet which again doesn't exist.
  • On Python (see pr Add logging to torchvision ops #4799), the torchvision.ops.boxes.batched_nms method was recorded as torchvision.ops.batched_nms. Though the method actually exists in that location due to being imported on the ops namespace, it's not the fully qualified name.

First of all we should note that some of the inconsistency preexisted your PRs/proposal. Moreover your proposal to log each method under the flat namespace does simplify things and at first glance looks very appealing. It also avoids recording very long fully qualified names such as torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg. Unfortunately it also opens up the possibility for name collisions and potentially misleading method locations. What are your thoughts on this?

cc @NicolasHug

@NicolasHug
Copy link
Member

NicolasHug commented Dec 10, 2021

Regarding Python, I think we could try to record the up-most subpackage (with an __init__.py file) where the function is available. This reflects how we expect our users to import these functions / methods. By "upmost" I mean the one that's closest to the tochvision. namespace

For example we explicitly expose batched_nms in torchvision.ops.__init__.py, so it makes sense to log torchvision.ops.batched_nms.

Similarly, since RetinaNet is not available in torchvision.models.__init__.py, I would agree that torchvision.models.detection.RetinaNet makes more sense.

@datumbox
Copy link
Contributor

For Python, recording the most compact valid name path of the method/class is reasonable. If that was the case, would you make _log_api_usage_once receive 2 params as is now or a single string as before? Both are possible aka:

_log_api_usage_once("models.detection", "RetinaNet")
# VS
_log_api_usage_once("torchvision.models.detection.RetinaNet")

Also how about the C++ side? Using the fully qualified name such as torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg sounds like an overkill. An alternative might be to use a path similar to how the methods are registered (see for example decode_jpeg).

It's probably worth clarifying these details fully prior making more changes to the code to minimize throwaway work. Let me know your thoughts, thanks!

@NicolasHug
Copy link
Member

NicolasHug commented Dec 10, 2021

use a path similar to how the methods are registered (see for example decode_jpeg).

Agree with this if we're sure to only ever log functions that we register through the dispatcher. Otherwise, we still need a convention for logging the non-registered functions, in which case I would go with this:

Using the fully qualified name such as torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg sounds like an overkill

It's a bit cumbersome I agree. But it also seems to be the most straightforward, dumb simple, and name-clash free strategy.

We can do both BTW: use the first strategy for registered functions, and use the other one as a fall-back

@kazhang
Copy link
Contributor Author

kazhang commented Dec 10, 2021

@NicolasHug @datumbox
Yes, I agree that the drawback of using flat namespace is name collision despite its simplicity.
With regards to the Python API, one consideration is model inheritance for modules, we can't simply use

_log_api_usage_once("models", "RetinaNet")

Because the model inherit from this class won't be tracked properly, instead, we should use

_log_api_usage_once("models", self.__class__.__name__)

Now, if we want to record the up-most subpackage, we can do

_log_api_usage_once("models.detection.", self.__class__.__name__)

But if there is another module inheriting from RetinaNet but not under models.detection(how likely is this?), the record of that module could be incorrect.
--EDIT--
Another disadvantage is that models inherit from TorchVision but not live in TorchVision will also be incorrectly logged as TorchVision events since the _log_api_usage_once adds torchvision. prefix to "models.detection"

@kazhang
Copy link
Contributor Author

kazhang commented Dec 10, 2021

I want to propose this:
Let's use the fully qualified names, such as torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg
For Python module, we simply call:

_log_api_usage_once(self.__module__, self.__class__.__name__)

For Python methods, we call:

_log_api_usage_once("torchvision.ops.boxes", "nms")

Pros:

  • simple policy that everyone could easily follow
  • no naming collision
  • correctly log events to corresponding top level module(e.g. "torchvision")

Cons:

  • it's too long to write

@NicolasHug @datumbox thoughts?

@datumbox
Copy link
Contributor

Thanks to both of you for their input.

As @kazhang mentioned, some models (like Quantized) use inheritance and hardcoding the module name or the class name won't work. So I agree the following should work for classes:

_log_api_usage_once(self.__module__, self.__class__.__name__)

For methods, we can use the same approach but instead of self, use the actual name of the method. For example for nms this would be:

_log_api_usage_once(nms.__module__, nms.__name__)

As far as I can tell, this works, the tests pass and JIT doesn't complain but you should verify prior replacing everything.

For C++, I'm OK using the fully qualified name: torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg

What I like about this proposal, is that it's easy to write and easy to verify. Kai's list of pros checks all of the requirements that we have, so I think this should be OK.

@NicolasHug
Copy link
Member

I'm not sure I fully understand the inheritance problem. IIUC if B inherits from A, and if we log in both __init__(), the logger will be called twice when B is instantiated?

But then logging self.__class__.__name__ vs "RetinaNet" is just a matter of which class will be recorded twice instead of once, right? Does that mean we should divide all logging metrics by 2 when we look at B?

@datumbox
Copy link
Contributor

@NicolasHug I don't think there is an inheritance problem. I think that Kai just wanted to warn against hardcoding class names in favour of using self.__class__.__name__. I agree with this statement. I think what triggered his response was my earlier example where I quoted full strings, but this was for illustration purposes only.

The corner case that Kai is talking about is related to how Quantized models are structured; in a nutshell the Quantized models inherit from the non-quantized. We could put the logger in both constructors if we think that this simplifies our policy but as long as we don't hardcode strings, the metrics will be fine without dividing by 2. Based on the documentation, "the callback fires only once for a given process for each of the APIs". Plus the calls should be further deduped later in the stats pipelines, so I think we should be OK.

@kazhang
Copy link
Contributor Author

kazhang commented Dec 13, 2021

@NicolasHug @datumbox

I think what triggered his response was my earlier example where I quoted full strings, but this was for illustration purposes only.

ah, thanks for the clarification! my apologies for the confusions

If everyone agree, let's move ahead with logging the fully qualified name:

  • C++
// use decode_jpeg as an example
C10_LOG_API_USAGE_ONCE("torchvision.csrc.io.image.cpu.decode_jpeg.decode_jpeg")
  • Python Class
    • call at the beginning of constructor (after super().__init__ if applicable);
    • even if the class inherit from another tracked class, we still want to add this call for simplicity;
_log_api_usage_once(self.__module__, self.__class__.__name__)
  • Python Method
# use nms as an example
_log_api_usage_once(nms.__module__, nms.__name__)

@kazhang
Copy link
Contributor Author

kazhang commented Dec 13, 2021

I also want to clarify another point, whether we should track modules that inherit from torchvision but don't live in torchvision, for example, classy vision could inherit from the torchvision.models.RegNet, with our current implementation, this will trigger a event like "classy_vision.models.regnet.RegNet".

I personally prefer not to log that kind of events because we can't really make use of those events: our pipeline currently only looks at events start with "torchvision". Therefore I recommend to add a explicit filter inside _log_api_usage_once to ignore events that don't start with torchvision

@kazhang
Copy link
Contributor Author

kazhang commented Dec 14, 2021

For the completeness, I also considered a simplified logging API

def _log_api_usage_once(obj: Callable) -> None:
    if torch.jit.is_scripting() or torch.jit.is_tracing():
        return
    name = obj.__class__.__name__
    if name == "function":
        name = obj.__name__
    torch._C._log_api_usage_once(f"{obj.__module__}.{name}")

The API call would be as simple as

_log_api_usage_once(self)  # for class
_log_api_usage_once(nms) # for methods

However, TorchScript doesn't like the idea

function cannot be used as a value:

see #5095

@kazhang
Copy link
Contributor Author

kazhang commented Dec 14, 2021

In 3c97c20e3bfa34, I found out that TorchScript doesn't allow to access __module__ of a function :(
We have to write the full qualified name for methods, e.g.

_log_api_usage_once("torchvision.ops.boxes", "batched_nms")

In this case, I would rather we use single string as the param for _log_api_usage_once

_log_api_usage_once("torchvision.ops.boxes.batched_nms")

For class, it could be

_log_api_usage_once(self.__class__.__qualname__)

@datumbox
Copy link
Contributor

datumbox commented Dec 14, 2021

It would be very nice to be able to have simple API as the one you proposed below:

_log_api_usage_once(self)  # for class
_log_api_usage_once(nms) # for methods

I modified slightly your code at #5095:

def _log_api_usage_once(obj):
    name = obj.__class__.__name__
    if name == "function":
        name = obj.__name__
    torch._C._log_api_usage_once(f"{obj.__module__}.{name}")



def box_area(boxes: Tensor) -> Tensor:
    if not torch.jit.is_scripting() and not torch.jit.is_tracing():
        _log_api_usage_once(box_area)
    # ...

JIT tests seem to pass.

@kazhang
Copy link
Contributor Author

kazhang commented Dec 14, 2021

Thanks @datumbox , basically we would need to check if is torchscripting for every function that needs to be complaint with TorchScript. Maybe it's better than writing the full qualified name like #5096
If everyone agrees, I will modify and reopen #5095
So it will be
For class

_log_api_usage_once(self)

For methods

_log_api_usage_once(make_grid) # for methods

For methods that need to be complaint with TorchScript:
(e.g. all functional transforms)

if not torch.jit.is_scripting() and not torch.jit.is_tracing():
    _log_api_usage_once(to_tensor)

@NicolasHug ?

@kazhang
Copy link
Contributor Author

kazhang commented Dec 20, 2021

#5007 and #5038 have been updated to reflect the changes in API and policy.
According to the new policy, we're supposed to add logging to quantization models and segmentation models even though they're being tracked via inheritance. Would like to get some help from the community.

@datumbox
Copy link
Contributor

datumbox commented Dec 21, 2021

@kazhang Thanks for the changes. Given that on the #5007 we didn't add calls for transforms such as RandomSizedCrop that use inheritance, I would be OK omitting the extra calls on Quantization and Segmentation. If we do add them, we need them also on transforms that inherit. No strong opinions, but I'm leaning towards omitting them, thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants