Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Oct 9, 2025

Errors occurring during Buffer.close are not raised. This change adds tests demonstrating the issue. See #1118.

@Andy-Jost Andy-Jost self-assigned this Oct 9, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test 845fbd4

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 845fbd4 to adfb7e5 Compare October 9, 2025 22:08
@Andy-Jost
Copy link
Contributor Author

/ok to test 0986f5e

@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 0986f5e to 78a4815 Compare October 9, 2025 22:17
@Andy-Jost
Copy link
Contributor Author

/ok to test 1d5248e

@Andy-Jost Andy-Jost added test Improvements or additions to tests cuda.core Everything related to the cuda.core module labels Oct 9, 2025
@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 1d5248e to bbdbbcd Compare October 9, 2025 22:33
@Andy-Jost Andy-Jost changed the title Add skipped tests demonstrating that errors in Buffer.close are not raised Add (failing) tests demonstrating that errors in Buffer.close are not raised Oct 9, 2025
@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch 2 times, most recently from 6e9c283 to fea6f8a Compare October 9, 2025 22:36
mr.close()


@pytest.mark.xfail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work here?

@pytest.mark.xfail(reason="Issue #1118", strict=True)

The important part is strict=True.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's good to know.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUGE 👎 in testing this.

Comment on lines +153 to +152
def test_error_in_close_memory_resource(ipc_memory_resource):
"""Test that errors when closing a memory resource are raised."""
mr = ipc_memory_resource
driver.cuMemPoolDestroy(mr.handle)
with pytest.raises(CUDAError, match=".*CUDA_ERROR_INVALID_VALUE.*"):
mr.close()
Copy link
Member

@leofang leofang Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is illegal and I disagree we need to test this. This is Python and we can't possibly guard against all kinds of bizarre ways of trying to mutate the state of our Python objects behind our back. In particular, as noted in both #1074 (comment) and offline discussion, errors like CUDA_ERROR_INVALID_VALUE are due to multiple frees. I thought we've moved on?

This test is just another instance of the same class of errors: We free the handle of an object through a direct C API call, bypassing our safeguard mechanism (under the hood we do check if the handle is already null before freeing, and then after free we set the handle to null to avoid double free), so our destructor kicks in, either through an explicit close() call or implicitly when going out of scope, and causes another free.

Copy link
Contributor

@cpcloud cpcloud Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this test is asserting that different levels of APIs offer the same guarantees, which would make developing layers of APIs really really difficult.

It seems roughly analogous to calling into the Python C API through ctypes, and expecting Python to somehow know you didn't mean to cause a segmentation violation:

❯ python3.13 -q
>>> x = 1
>>> import ctypes
>>> ctypes.pythonapi.Py_DecRef(x)
zsh: segmentation fault (core dumped)  python3.13 -q

Copy link
Contributor

@cpcloud cpcloud Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess there's probably another way to write this test such that the behavior is buggy without crossing into the C abyss of naked bindings.

Would it be enough to just call close() twice? That seems like something we should perhaps be robust to if we're not already:

❯ python -q
>>> f = open('/tmp/x', 'w')
>>> f.close()
>>> f.close()

Copy link
Contributor Author

@Andy-Jost Andy-Jost Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a misunderstanding. The point of this test is to check that errors occurring in close are raised rather than suppressed. The exact mechanism I chose to get the driver to generate an error is beside the point, If anyone sees a simpler mechanism, please point it out.

I ran into this issue (close not raising errors) when working with the driver bug 5570902. I don't want to rely on that behavior for the test because if the driver team ever fixes the bug, the test would then break.

Let's not confuse the issue: we are not testing our robustness in the face of nonsense behavior, i.e., someone stomping around in the driver API directly. This test just does something otherwise dumb and unsupported to trigger an error, which is totally reasonable for this test.

@cpcloud our code already tolerates double-calls to close and simply ignores the second call without error.

Copy link
Member

@leofang leofang Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no misunderstanding. Undefined behavior should not be tested. It is just confusing and sends a wrong signal.

re: on allowing the behavior that calling .close() is allowed to raise while invoking the destructor is not, I did raise a discussion with @pciolkosz on the Tuesday meeting. It is indeed nice to have, but it is unclear semantically what we can do about the exception, and technically it does not seem possible to implement in both C++ and Python (buffer.destroy(stream) from cccl-runtime is also noexcept). I suggest we table this discussion for later, and close this PR.

Comment on lines +167 to +163
driver.cuMemFree(buffer.handle)
with pytest.raises(CUDAError, match=".*CUDA_ERROR_INVALID_VALUE.*"):
buffer.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +181 to +180
try:
driver.cuMemPoolDestroy(self.mr.handle)
except Exception: # noqa: S110
pass
else:
self.mr.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@leofang
Copy link
Member

leofang commented Oct 10, 2025

FWIW cccl-runtime, C++ stdandad library, or any high-level frameworks/libraries have the same issue. We give you the access to the underlying handle of a container, does not mean that you can free it through free() or delete behind the container's back. This is UB and by testing it we are guaranteeing certain behavior (whatever it is).

@Andy-Jost
Copy link
Contributor Author

FWIW cccl-runtime, C++ stdandad library, or any high-level frameworks/libraries have the same issue. We give you the access to the underlying handle of a container, does not mean that you can free it through free() or delete behind the container's back. This is UB and by testing it we are guaranteeing certain behavior (whatever it is).

Respectfully, I think this misses the point. When implicitly closing resources via __del__ or __dealloc__, there may be good reasons to suppress errors. However, when calling a function such as close directly, errors should not be suppressed.

@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 06c8d2d to c92fb6c Compare October 13, 2025 17:11
@Andy-Jost Andy-Jost closed this Oct 21, 2025
github-actions bot pushed a commit that referenced this pull request Nov 10, 2025
Removed preview folders for the following PRs:
- PR #1021
- PR #1034
- PR #1052
- PR #1059
- PR #1069
- PR #1086
- PR #1090
- PR #1096
- PR #1102
- PR #1103
- PR #1106
- PR #1107
- PR #1117
- PR #1133
- PR #1140
- PR #1166
- PR #1174
- PR #1185
- PR #1188
- PR #1191
... and 41 more
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants