NEW: Make event timing error messages more specific and actionable #559

carterbox · 2025-04-14T23:26:10Z

Description

closes #556

The CUDA driver provides different error messages for various errors when trying to compute elapsed time, and the documentation explains each of these scenarious. Surface each of these to Python uses with actionable error messages.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

* Not sure what documentation should be added. Sometimes python functions document what errors are raised, but sub doesn't have a docs section.

The CUDA driver provides different error messages for various errors when trying to compute elapsed time, and the documentation explains each of these scenarious. Surface each of these to Python uses with actionable error messages.

copy-pr-bot · 2025-04-14T23:26:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

carterbox · 2025-04-14T23:31:50Z

/ok to test 8782dfa

kkraus14 · 2025-04-15T01:18:04Z

/ok to test 8782dfa

cuda_core/cuda/core/experimental/_event.py

rwgk · 2025-04-15T03:28:48Z

cuda_core/tests/test_event.py

+    event1 = stream.record(options=enabled)
+    event2 = stream.record(options=enabled)
+    stream.sync()
+    event2 - event1


I'd delete this last block, because that's already covered more comprehensively under the existing test_timing (test_timing_success).

Did you maybe forget to push the commits?

No. I'm still working on addressing feedback and would prefer to push the commits all at once instead of triggering the CI multiple times.

Our CI is manual trigger only, so don't worry about excessive pushes 🙂

cuda_core/tests/test_event.py

rwgk · 2025-04-15T21:42:57Z

I think we are set up so that the CI will never trigger automatically. Only of you leave the ok to test comment. Certainly in draft mode.

cuda_core/tests/test_event.py

carterbox · 2025-04-15T21:46:04Z

/ok to test 99218bb

kkraus14 · 2025-04-18T01:35:21Z

cuda_core/tests/test_event.py

+    stream.wait(event2)
+    event3 = stream.record(options=enabled)
+
+    # event3 will never complete because the stream is waiting on event2 which is never recorded


Wouldn't there be no work recorded in event2 here so the stream.wait(event2) wouldn't actually wait on anything? Then event3 recording on the stream there again wouldn't be any actual work recorded so it very well could be finished?

This is interesting, I would think that because event2 has no work recorded, stream.wait(event2) should raise (maybe cudaErrorInvalidResourceHandle?) instead of no-op.

Took a quick look at the actual implementation, and Keith was right here. It's a no-op and cudaSuccess is returned if event2 is not recorded.

So what's the solution for testing this case? Launching a kernel that takes a long time to complete or never returns? I believe cuda.core now has all the necessary parts to compile and launch such a kernel from this test?

@carterbox It is nerve wrecking to leave the kernel spinning and not clean it up after the event test is completed. I pushed commit add0ba1 to allow a signal passed from host to the busy kernel.

…t error

leofang · 2025-04-27T02:35:06Z

/ok to test add0ba1

leofang · 2025-04-27T03:11:17Z

/ok to test 638990d

leofang · 2025-04-27T03:57:01Z

/ok to test 9904f3d

.github/workflows/test-wheel-windows.yml

leofang · 2025-04-27T04:27:10Z

Thanks for nice improvements, Daniel!

github-actions · 2025-04-27T04:45:13Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk · 2025-04-28T18:43:31Z

cuda_core/tests/test_event.py

+# TODO: improve this once path finder can find headers
+@pytest.mark.skipif(os.environ.get("CUDA_PATH") is None, reason="need libcu++ header")
+@pytest.mark.skipif(tuple(int(i) for i in np.__version__.split(".")[:2]) < (2, 1), reason="need numpy 2.1.0+")
+def test_error_timing_incomplete():


Wow, this is an amazingly sophisticated unit test setup.

This comment has been minimized.

Sign in to view

kkraus14 reviewed Apr 15, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_event.py Outdated Show resolved Hide resolved

rwgk reviewed Apr 15, 2025

View reviewed changes

carterbox added 4 commits April 15, 2025 16:06

REF: Use error types instead of string comparison

e7f0f85

TST: Remove redundant test

d66580e

STY: Separate regular from renaming imports

2cf2dc8

TST: Refactor event class tests

99218bb

carterbox commented Apr 15, 2025

View reviewed changes

cuda_core/tests/test_event.py Show resolved Hide resolved

carterbox requested review from kkraus14 and rwgk April 15, 2025 21:46

kkraus14 reviewed Apr 18, 2025

View reviewed changes

leofang assigned carterbox Apr 18, 2025

leofang added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels Apr 18, 2025

leofang added this to the cuda.core beta 4 milestone Apr 18, 2025

TST: Use infinite kernel instead of unrecorded event for testing even…

4314886

…t error

carterbox requested review from kkraus14 and leofang April 21, 2025 21:53

ensure the busy kernel can be shut down

add0ba1

leofang previously approved these changes Apr 27, 2025

View reviewed changes

from_dlpack needs a recent NumPy fix

9f530b6

leofang dismissed their stale review via 9f530b6 April 27, 2025 03:01

leofang added 2 commits April 26, 2025 20:02

Merge branch 'main' into dching/event-timing-message

18ecca6

ensure cccl headers are available (with local ctk) at test time

638990d

there is no cccl component on Windows...

9904f3d

leofang reviewed Apr 27, 2025

View reviewed changes

.github/workflows/test-wheel-windows.yml Show resolved Hide resolved

leofang approved these changes Apr 27, 2025

View reviewed changes

leofang enabled auto-merge April 27, 2025 04:25

leofang merged commit 2aca306 into NVIDIA:main Apr 27, 2025
75 checks passed

rwgk reviewed Apr 28, 2025

View reviewed changes

NEW: Make event timing error messages more specific and actionable #559

NEW: Make event timing error messages more specific and actionable #559

Uh oh!

Conversation

carterbox commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Apr 14, 2025

Uh oh!

carterbox commented Apr 14, 2025

Uh oh!

kkraus14 commented Apr 15, 2025

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carterbox Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rwgk commented Apr 15, 2025

Uh oh!

Uh oh!

carterbox commented Apr 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang commented Apr 27, 2025

Uh oh!

leofang commented Apr 27, 2025

Uh oh!

leofang commented Apr 27, 2025

Uh oh!

Uh oh!

Uh oh!

leofang commented Apr 27, 2025

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carterbox commented Apr 14, 2025 •

edited

Loading

carterbox Apr 15, 2025 •

edited

Loading