-
Notifications
You must be signed in to change notification settings - Fork 157
NEW: Make event timing error messages more specific and actionable #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The CUDA driver provides different error messages for various errors when trying to compute elapsed time, and the documentation explains each of these scenarious. Surface each of these to Python uses with actionable error messages.
/ok to test 8782dfa |
1 similar comment
/ok to test 8782dfa |
This comment has been minimized.
This comment has been minimized.
cuda_core/tests/test_event.py
Outdated
event1 = stream.record(options=enabled) | ||
event2 = stream.record(options=enabled) | ||
stream.sync() | ||
event2 - event1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd delete this last block, because that's already covered more comprehensively under the existing test_timing
(test_timing_success
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you maybe forget to push the commits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I'm still working on addressing feedback and would prefer to push the commits all at once instead of triggering the CI multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CI is manual trigger only, so don't worry about excessive pushes 🙂
I think we are set up so that the CI will never trigger automatically. Only of you leave the ok to test comment. Certainly in draft mode. |
/ok to test 99218bb |
cuda_core/tests/test_event.py
Outdated
stream.wait(event2) | ||
event3 = stream.record(options=enabled) | ||
|
||
# event3 will never complete because the stream is waiting on event2 which is never recorded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't there be no work recorded in event2
here so the stream.wait(event2)
wouldn't actually wait on anything? Then event3
recording on the stream there again wouldn't be any actual work recorded so it very well could be finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, I would think that because event2
has no work recorded, stream.wait(event2)
should raise (maybe cudaErrorInvalidResourceHandle
?) instead of no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look at the actual implementation, and Keith was right here. It's a no-op and cudaSuccess
is returned if event2
is not recorded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the solution for testing this case? Launching a kernel that takes a long time to complete or never returns? I believe cuda.core now has all the necessary parts to compile and launch such a kernel from this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carterbox It is nerve wrecking to leave the kernel spinning and not clean it up after the event test is completed. I pushed commit add0ba1 to allow a signal passed from host to the busy kernel.
/ok to test add0ba1 |
/ok to test 638990d |
/ok to test 9904f3d |
Thanks for nice improvements, Daniel! |
|
# TODO: improve this once path finder can find headers | ||
@pytest.mark.skipif(os.environ.get("CUDA_PATH") is None, reason="need libcu++ header") | ||
@pytest.mark.skipif(tuple(int(i) for i in np.__version__.split(".")[:2]) < (2, 1), reason="need numpy 2.1.0+") | ||
def test_error_timing_incomplete(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, this is an amazingly sophisticated unit test setup.
Description
closes #556
The CUDA driver provides different error messages for various errors when trying to compute elapsed time, and the documentation explains each of these scenarious. Surface each of these to Python uses with actionable error messages.
Checklist
* Not sure what documentation should be added. Sometimes python functions document what errors are raised, but sub doesn't have a docs section.