Fix torch.accelerator api abort when passing invaild device #143550

guangyey · 2024-12-19T03:19:59Z

Stack from ghstack (oldest at bottom):

-> Fix torch.accelerator api abort when passing invaild device #143550

Motivation

Fix #143543

Solution

We should raise python exception instead of aborting...

Additional Context

without this PR:

>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
  what():  device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

with this PR:

>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
    return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.

cc @albanD @EikanWang

pytorch-bot · 2024-12-19T03:20:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143550

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2403bb2 with merge base 19d8bba ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_inplace_padding.py::InplacePaddingTest::test_linear_and_cel
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_conv3d_xpu

This comment was automatically generated by Dr. CI and updates every 15 minutes.

EikanWang · 2024-12-19T05:15:50Z

@guangyey , pls. help refine the error message a little bit as it is user-facing.

ghstack-source-id: b27b3e6 Pull Request resolved: #143550

guangyey · 2024-12-19T06:19:42Z

@guangyey , pls. help refine the error message a little bit as it is user-facing.

Updated in 2403bb2

EikanWang

LGTM

[ghstack-poisoned]

dvrogozh

Works for me to address reported #143543.

albanD

oops !

guangyey · 2024-12-23T03:36:38Z

@pytorchbot merge

pytorchmergebot · 2024-12-23T03:38:37Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

guangyey requested review from EikanWang, eqy, gujinghui and syed-ahmed as code owners December 19, 2024 03:20

guangyey changed the title ~~Fix torch.accelerator api abort when passing invaild device~~ [WIP] Fix torch.accelerator api abort when passing invaild device Dec 19, 2024

guangyey added the topic: improvements topic category label Dec 19, 2024

pytorchbot added the open source label Dec 19, 2024

guangyey requested review from jeffdaily, jithunnair-amd, kulinseth and malfet as code owners December 19, 2024 03:46

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Dec 19, 2024

guangyey added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm release notes: xpu release notes category labels Dec 19, 2024

guangyey mentioned this pull request Dec 19, 2024

xpu: torch.accelerate.current_stream() throws C++ instead of Python exception on invalid device #143543

Closed

EikanWang approved these changes Dec 19, 2024

View reviewed changes

guangyey added a commit that referenced this pull request Dec 19, 2024

Fix torch.accelerator api abort when passing invaild device

2d0d447

ghstack-source-id: b27b3e6 Pull Request resolved: #143550

guangyey requested a review from albanD December 19, 2024 06:20

EikanWang approved these changes Dec 19, 2024

View reviewed changes

Update

5dd62df

[ghstack-poisoned]

guangyey added 4 commits December 19, 2024 11:24

Update

8eb7df7

[ghstack-poisoned]

Update

75d5f32

[ghstack-poisoned]

Update

046ad8f

[ghstack-poisoned]

Update

2403bb2

[ghstack-poisoned]

dvrogozh approved these changes Dec 19, 2024

View reviewed changes

guangyey changed the title ~~[WIP] Fix torch.accelerator api abort when passing invaild device~~ Fix torch.accelerator api abort when passing invaild device Dec 20, 2024

guangyey added the module: accelerator Issues related to the shared accelerator API label Dec 20, 2024

albanD approved these changes Dec 20, 2024

View reviewed changes

pytorchmergebot added the merging label Dec 23, 2024

pytorchmergebot added the Merged label Dec 23, 2024

pytorchmergebot closed this in 07fa6e2 Dec 23, 2024

pytorchmergebot removed the merging label Dec 23, 2024

github-actions bot deleted the gh/guangyey/110/head branch January 23, 2025 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix torch.accelerator api abort when passing invaild device #143550

Fix torch.accelerator api abort when passing invaild device #143550

Uh oh!

guangyey commented Dec 19, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 19, 2024 •

edited

Loading

Uh oh!

EikanWang commented Dec 19, 2024

Uh oh!

guangyey commented Dec 19, 2024

Uh oh!

EikanWang left a comment

Uh oh!

dvrogozh left a comment

Uh oh!

albanD left a comment

Uh oh!

guangyey commented Dec 23, 2024

Uh oh!

pytorchmergebot commented Dec 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Fix torch.accelerator api abort when passing invaild device #143550

Fix torch.accelerator api abort when passing invaild device #143550

Uh oh!

Conversation

guangyey commented Dec 19, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Solution

Additional Context

Uh oh!

pytorch-bot bot commented Dec 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143550

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

EikanWang commented Dec 19, 2024

Uh oh!

guangyey commented Dec 19, 2024

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

guangyey commented Dec 23, 2024

Uh oh!

pytorchmergebot commented Dec 23, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

guangyey commented Dec 19, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 19, 2024 •

edited

Loading