JIT related failures using Core nightly 20220426 #5881

datumbox · 2022-04-26T12:17:20Z

🐛 Describe the bug

The latest main branch fails the cmake_* jobs with the following error:

+ ./test_frcnn_tracing
Loading model
Model loaded
terminate called after throwing an instance of 'std::runtime_error'
  what():  The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torchvision/models/detection/faster_rcnn.py", line 93, in forward
    roi_heads = self.roi_heads
    image_sizes = images0.image_sizes
    _26 = (roi_heads).forward(features0, proposals, image_sizes, targets4, )
           ~~~~~~~~~~~~~~~~~~ <--- HERE
    detections, detector_losses, = _26
    transform0 = self.transform
  File "code/__torch__/torchvision/models/detection/roi_heads.py", line 341, in forward
      labels8 = torch.index(labels7, _104)
      nms_thresh = self.nms_thresh
      keep0 = _90(boxes4, scores3, labels8, nms_thresh, )
              ~~~ <--- HERE
      detections_per_img = self.detections_per_img
      keep1 = torch.slice(keep0, 0, None, detections_per_img)
  File "code/__torch__/torchvision/ops/boxes.py", line 49, in batched_nms
    _14 = False
  if _14:
    _17 = _9(boxes, scores, idxs, iou_threshold, )
          ~~ <--- HERE
    _16 = _17
  else:
  File "code/__torch__/torchvision/ops/boxes.py", line 74, in _batched_nms_vanilla
    iou_threshold: float) -> Tensor:
  _26 = __torch__.torch.functional._return_output
  keep_mask = torch.zeros_like(scores, dtype=11)
              ~~~~~~~~~~~~~~~~ <--- HERE
  _27 = _26(idxs, True, False, False, None, )
  for _28 in range(torch.len(_27)):

Traceback of TorchScript, original code (most recent call last):
  File "/root/project/torchvision/models/detection/generalized_rcnn.py", line 105, in forward
            features = OrderedDict([("0", features)])
        proposals, proposal_losses = self.rpn(images, features, targets)
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
                                      ~~~~~~~~~~~~~~ <--- HERE
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)  # type: ignore[operator]
    
  File "/root/project/torchvision/models/detection/roi_heads.py", line 716, in forward
    
            # non-maximum suppression, independently done per class
            keep = box_ops.batched_nms(boxes, scores, labels, self.nms_thresh)
                   ~~~~~~~~~~~~~~~~~~~ <--- HERE
            # keep only topk scoring predictions
            keep = keep[: self.detections_per_img]
  File "/root/project/torchvision/ops/boxes.py", line 72, in batched_nms
    # https://github.com/pytorch/vision/issues/1311#issuecomment-781329339
    if boxes.numel() > (4000 if boxes.device.type == "cpu" else 20000) and not torchvision._is_tracing():
        return _batched_nms_vanilla(boxes, scores, idxs, iou_threshold)
               ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    else:
        return _batched_nms_coordinate_trick(boxes, scores, idxs, iou_threshold)
  File "/root/project/torchvision/ops/boxes.py", line 105, in _batched_nms_vanilla
) -> Tensor:
    # Based on Detectron2 implementation, just manually call nms() on each class independently
    keep_mask = torch.zeros_like(scores, dtype=torch.bool)
                ~~~~~~~~~~~~~~~~ <--- HERE
    for class_id in torch.unique(idxs):
        curr_indices = torch.where(idxs == class_id)[0]
RuntimeError: isDifferentiableType(variable.scalar_type()) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1650956998902/work/torch/csrc/autograd/functions/utils.h":65, please report a bug to PyTorch. 

packaging/build_cmake.sh: line 106:  2593 Aborted                 (core dumped) ./test_frcnn_tracing

Exited with code exit status 134

More JIT-related failures can be seen at unittest_linux_gpu_py3.8:

test_jit[4-dtype0-cuda-mean-2-1.0]
Traceback (most recent call last):
  File "/home/circleci/project/test/test_ops.py", line 1566, in test_jit
    scripted_focal_loss = script_fn(inputs, targets, gamma=gamma, alpha=alpha, reduction=reduction)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: required keyword attribute 'cache_id' is undefined

It looks like this is due to an upstream change on PyTorch Core.

Versions

The breakage on the cmake_* jobs started appearing on the 20220426 nightly, see 66ed693.

~~Though other breakages existed using nightly 20220425, the cmake jobs were passing normally (see de31e4b). All of those breakages are due to upstream changes and they are documented at #5873~~

We have merged #5875 to solve the above issues related to nightly 20220425. For the new issues caused by 20220426 please check the CI results of commit cc53cd0 which provides a clearer view of the issue.

cc @seemethere

The text was updated successfully, but these errors were encountered:

datumbox · 2022-04-26T15:40:42Z

The pytorch/pytorch#75983 PR was reverted pytorch/pytorch@bfb39e5 since it's likely to be causing the failure at unittest_linux_gpu_py3.8.

Also @vfdev-5 confirmed that the pytorch/pytorch#75913 PR causes the issue on the cmake_* jobs.

datumbox · 2022-04-27T13:52:32Z

The problems are fixed on the latest nightly 20220427.

datumbox added dependency issue module: ci bug labels Apr 26, 2022

datumbox changed the title ~~cmake jobs fail with The following operation failed in the TorchScript interpreter~~ cmake jobs fail with JIT issues Apr 26, 2022

datumbox changed the title ~~cmake jobs fail with JIT issues~~ JIT related failures using Core nightly 20220426 Apr 26, 2022

This was referenced Apr 26, 2022

[NVFuser] always use fallback if fusion fails pytorch/pytorch#75983

Closed

functionalization: introduce a "zero()" aten op pytorch/pytorch#75913

Closed

datumbox mentioned this issue Apr 27, 2022

add python linking dependency only when USE_PYTHON=on #5602

Open

datumbox added the core issue label Apr 27, 2022

datumbox closed this as completed Apr 27, 2022

laitingsheng mentioned this issue Aug 10, 2022

Remove Python linking dependency if USE_PYTHON=OFF #6378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT related failures using Core nightly 20220426 #5881

JIT related failures using Core nightly 20220426 #5881

datumbox commented Apr 26, 2022 •

edited

Loading

datumbox commented Apr 26, 2022

Uh oh!

datumbox commented Apr 27, 2022

Uh oh!

JIT related failures using Core nightly 20220426 #5881

JIT related failures using Core nightly 20220426 #5881

Comments

datumbox commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🐛 Describe the bug

Versions

datumbox commented Apr 26, 2022

Uh oh!

datumbox commented Apr 27, 2022

Uh oh!

datumbox commented Apr 26, 2022 •

edited

Loading