Skip to content

NEW: Make event timing error messages more specific and actionable #559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 27, 2025

Conversation

carterbox
Copy link
Contributor

@carterbox carterbox commented Apr 14, 2025

Description

closes #556

The CUDA driver provides different error messages for various errors when trying to compute elapsed time, and the documentation explains each of these scenarious. Surface each of these to Python uses with actionable error messages.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

* Not sure what documentation should be added. Sometimes python functions document what errors are raised, but sub doesn't have a docs section.

The CUDA driver provides different error messages for various
errors when trying to compute elapsed time, and the documentation
explains each of these scenarious. Surface each of these to Python
uses with actionable error messages.
Copy link
Contributor

copy-pr-bot bot commented Apr 14, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@carterbox
Copy link
Contributor Author

/ok to test 8782dfa

1 similar comment
@kkraus14
Copy link
Collaborator

/ok to test 8782dfa

This comment has been minimized.

event1 = stream.record(options=enabled)
event2 = stream.record(options=enabled)
stream.sync()
event2 - event1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delete this last block, because that's already covered more comprehensively under the existing test_timing (test_timing_success).

Copy link
Contributor Author

@carterbox carterbox Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you maybe forget to push the commits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I'm still working on addressing feedback and would prefer to push the commits all at once instead of triggering the CI multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI is manual trigger only, so don't worry about excessive pushes 🙂

@rwgk
Copy link
Collaborator

rwgk commented Apr 15, 2025

I think we are set up so that the CI will never trigger automatically. Only of you leave the ok to test comment. Certainly in draft mode.

@carterbox
Copy link
Contributor Author

/ok to test 99218bb

@carterbox carterbox requested review from rwgk and kkraus14 April 15, 2025 21:46
stream.wait(event2)
event3 = stream.record(options=enabled)

# event3 will never complete because the stream is waiting on event2 which is never recorded
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't there be no work recorded in event2 here so the stream.wait(event2) wouldn't actually wait on anything? Then event3 recording on the stream there again wouldn't be any actual work recorded so it very well could be finished?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I would think that because event2 has no work recorded, stream.wait(event2) should raise (maybe cudaErrorInvalidResourceHandle?) instead of no-op.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look at the actual implementation, and Keith was right here. It's a no-op and cudaSuccess is returned if event2 is not recorded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what's the solution for testing this case? Launching a kernel that takes a long time to complete or never returns? I believe cuda.core now has all the necessary parts to compile and launch such a kernel from this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carterbox It is nerve wrecking to leave the kernel spinning and not clean it up after the event test is completed. I pushed commit add0ba1 to allow a signal passed from host to the busy kernel.

@leofang leofang added enhancement Any code-related improvements P1 Medium priority - Should do cuda.core Everything related to the cuda.core module labels Apr 18, 2025
@leofang leofang added this to the cuda.core beta 4 milestone Apr 18, 2025
@carterbox carterbox requested review from leofang and kkraus14 April 21, 2025 21:53
@leofang
Copy link
Member

leofang commented Apr 27, 2025

/ok to test add0ba1

leofang
leofang previously approved these changes Apr 27, 2025
@leofang
Copy link
Member

leofang commented Apr 27, 2025

/ok to test 638990d

@leofang
Copy link
Member

leofang commented Apr 27, 2025

/ok to test 9904f3d

@leofang leofang enabled auto-merge April 27, 2025 04:25
@leofang leofang merged commit 2aca306 into NVIDIA:main Apr 27, 2025
75 checks passed
@leofang
Copy link
Member

leofang commented Apr 27, 2025

Thanks for nice improvements, Daniel!

Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

# TODO: improve this once path finder can find headers
@pytest.mark.skipif(os.environ.get("CUDA_PATH") is None, reason="need libcu++ header")
@pytest.mark.skipif(tuple(int(i) for i in np.__version__.split(".")[:2]) < (2, 1), reason="need numpy 2.1.0+")
def test_error_timing_incomplete():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is an amazingly sophisticated unit test setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Event timing error misleading
4 participants