Skip to content

Missing fixed_size value in GeneralizedRCNNTransform breaks Faster-RCNN torchscript loading with C++ in train mode #4366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
beniz opened this issue Sep 5, 2021 · 10 comments

Comments

@beniz
Copy link

beniz commented Sep 5, 2021

🐛 Describe the bug

fixed_size value in GeneralizedRCNNTransform instantiation for faster_rcnn defaults to None which breaks torchcript inference in C++.

See

transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std)
and compare to
self.transform = GeneralizedRCNNTransform(min(size), max(size), image_mean, image_std,
where fixed_size is explicitely set.

Thus with faster_rcnn, fixed_size defaults to None and loading from C++ yields:

Dynamic exception type: torch::jit::ErrorReport
std::exception::what: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 11
  image_std : List[float]
  size_divisible : int
  fixed_size : NoneType
               ~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchvision.models.detection.transform.GeneralizedRCNNTransform,
    images: List[Tensor],

To reproduce, we export the model with torch.jit.script for fasterrcnn_resnet50_fpn and we load from C++ with torch::jit::load().

Actually the exact export Python code we use is here: https://github.com/jolibrain/deepdetect/blob/master/tools/torch/trace_torchvision.py and we run:

python3 trace_torchvision.py fasterrcnn_resnet50_fpn --num_classes 2

Versions

PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.21.1
Libc version: glibc-2.25

Python version: 3.6.9 (default, Jan 26 2021, 15:33:00)  [GCC 8.4.0] (64-bit runtime)
Python platform: Linux-4.15.0-151-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA RTX A5000
GPU 1: NVIDIA TITAN X (Pascal)
GPU 2: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.2
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.2
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.18.1
[pip3] torch==1.9.0+cu111
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchvision==0.10.0+cu111
[pip3] torchviz==0.0.1
[conda] Could not collect
```

cc @datumbox
@datumbox
Copy link
Contributor

datumbox commented Sep 5, 2021

@beniz Thanks for reporting.

I noticed that your script sets the model in training model before exporting. Is that intentional? Have you tried it without it?

Edit: I confirm that the test_frcnn_tracing.cpp test will fail if the model is set in training mode. See #4367

+ ./test_frcnn_tracing
Loading model
Model loaded
terminate called after throwing an instance of 'torch::jit::JITException'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchvision/models/detection/faster_rcnn.py", line 25, in forward
      _5 = False
    if _5:
      ops.prim.RaiseException(_0)
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    else:
      pass

Traceback of TorchScript, original code (most recent call last):
  File "/root/project/torchvision/models/detection/generalized_rcnn.py", line 57, in forward
        """
        if self.training and targets is None:
            raise ValueError("In training mode, targets should be passed")
            ~~~~~~~~~~~~~~~~~~ <--- HERE
        if self.training:
            assert targets is not None
RuntimeError: In training mode, targets should be passed

packaging/build_cmake.sh: line 96:  2725 Aborted                 (core dumped) ./test_frcnn_tracing

Frankly I'm not sure exporting it in training mode will work. We could fix the specific issue, but I believe it's likely to fail elsewhere.

@fmassa any thoughts on whether we ever intended to support this?

@beniz
Copy link
Author

beniz commented Sep 5, 2021

Thanks @datumbox for the super quick answer, and on a Sunday! So yes, good catch, I should have mentioned it, the training mode is intentional, since we're training from C++.
Thus I'd understand this may be outside of provided support since training from C++ may not be so common.

I should mention it's possible to workaround the issue by setting the transform function by hand:

model.transform = M.detection.faster_rcnn.GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std, fixed_size=(min_size,min_size))

But as you correctly anticipated, the export in training mode fails a bit further down the road.

Also, note that our script works fine exporting in training mode when we are specifiying the backbone with:

python3 trace_torchvision.py fasterrcnn --backbone resnet50 --num_classes 2

So I was thinking that there may be a generic path to export for training, starting with the fixed_size issue.

My take at this stage:

  • it'd still be useful to have fixed_size set to something else than None
  • it'd be useful to have GeneralizedRCNNTransform exportable for training

I've started rewriting a simplified GeneralizedRCNNTransform to accomodate the forward() and postprocess() functions, I can help if pointed in the right direction!

@datumbox
Copy link
Contributor

datumbox commented Sep 6, 2021

@beniz Because I don't have your setup, it's hard for me to reproduce it. Could you please let me know what happens if you apply the patch at #4369 to your TorchVision code? Is the fixed_size error resolved?

Note that unfortunately changing the value to fixed_size=(min_size,min_size) will have a completely different behaviour in the model. Instead of resizing the images depending on the min/max directives, it will fix it to the given size. So it's not really recommended doing it...

@beniz
Copy link
Author

beniz commented Sep 6, 2021

Thanks for the patch @datumbox. It does not resolve the fixed_size error, and my understanding is that https://github.com/pytorch/vision/pull/4369/files#diff-915d3fa993285c902d3bb51970cd119a3a24cfcf6a587aaa5f227bd1213a3beeR76 still sets the value to None. So typing with Optional does not appear to be enough, as inference (in train mode) still raises on NoneType.

No worries about the fixed_size value to be set in practice, an empty list could do the trick.

FYI, when we workaround the fixed_size issue by forcing a value, the next error is:

Dynamic exception type: torch::jit::ErrorReport
std::exception::what: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/torchvision/models/detection/image_list.py", line 6
  def __init__(self: __torch__.torchvision.models.detection.image_list.ImageList,
    tensors: Tensor,
    image_sizes: List[Tuple[int, int]]) -> NoneType:
                                           ~~~~~~~~ <--- HERE

More generally, I guess this boils down to finding a way to handle None values at jit::load time, while at inference these values are actually set ?

@datumbox
Copy link
Contributor

datumbox commented Sep 6, 2021

@beniz Could you provide the exact error you get after applying the patch at #4369? My understanding is that the optional and the type definition should make it work.

Concerning the None at torchvision.models.detection.image_list.ImageList, yes I think there will be multiple other parts failing. The intention is supporting JIT for inference (to enable mobile use-cases), not for training. You could keep applying workarounds but you might eventually hit landmines like setting fixed_size=(min_size,min_size) which will mess up your training in unintended ways. It's best if you keep the training on the PyThon world and inference in C++ to avoid surprises.

@beniz
Copy link
Author

beniz commented Sep 6, 2021

@beniz Could you provide the exact error you get after applying the patch at #4369? My understanding is that the optional and the type definition should make it work.

Sure, I did patch a fresh clone with patch -p1 < 4369.patch then rebuilt torchvision. I am then forcing the script to load the patched version with sys.path.insert(0,'/path/to/lib/python/torchvision-0.11.0a0+80d5f50-py3.6-linux-x86_64.egg/'). I do double-check that the patched version is the one loaded up, with print(torchvision.__file__).
Let me know if this looks wrong.

Don't worry for us about the remaining issues like input sizes, we know what we're doing :)

FYI, we looked at it this morning and prior versions of torchvision used to export nicely for training, it did break somehow :( Once fixed (whether here or with custom patched code on our side), we'll add export to our CI.

@datumbox
Copy link
Contributor

datumbox commented Sep 6, 2021

Thanks for confirming your loading strategy. Could you provide the full error after running your code with the patch?

@beniz
Copy link
Author

beniz commented Sep 6, 2021

Thanks for your attention to this issue @datumbox, the error remains identical:

Dynamic exception type: torch::jit::ErrorReport
std::exception::what: 
Unknown type name 'NoneType':
Serialized   File "code/__torch__/torchvision/models/detection/transform.py", line 11
  image_std : List[float]
  size_divisible : int
  fixed_size : NoneType
               ~~~~~~~~ <--- HERE
  def forward(self: __torch__.torchvision.models.detection.transform.GeneralizedRCNNTransform,
    images: List[Tensor],

Let me know if I can help by testing a few more code changes you would suggest, as I have the setup ready and I could report results on a range of changes.

@beniz beniz changed the title Missing fixed_size value in GeneralizedRCNNTransform breaks Faster-RCNN torchscript loading with C++ Missing fixed_size value in GeneralizedRCNNTransform breaks Faster-RCNN torchscript loading with C++ in train mode Sep 6, 2021
@datumbox
Copy link
Contributor

datumbox commented Sep 6, 2021

@beniz I've temporarily modified a similar test that we have at vision here to export the model on train mode. I then passed data through it and I don't get any errors, see here.

Without being able to properly reproduce the error you see, it's hard to provide guidance. Would you be able to send a dummy PR where you modify the above scripts in a way that they get similar to your setup and reproduce the error on our CI (see the linked commit above for example)? If you manage to reproduce it with a minimal example, I can help you investigate further.

@beniz
Copy link
Author

beniz commented Sep 6, 2021

@datumbox Thanks very much. So further tests on our side did reveal that the C++ build was using torch 1.8, while with 1.9 there's no error.
My deepest apologies for the time required on your side, maybe PR #4369 remains useful if only by principle of having properly typed signatures.
I'm closing this issue, thanks again for this and for the excellent work by the torchvision team!

@beniz beniz closed this as completed Sep 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants