-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Do not check the GIL/decref object if the Python runtime is dead #4769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not check the GIL/decref object if the Python runtime is dead #4769
Conversation
include/pybind11/pytypes.h
Outdated
@@ -265,12 +265,15 @@ class handle : public detail::object_api<handle> { | |||
this function automatically. Returns a reference to itself. | |||
\endrst */ | |||
const handle &dec_ref() const & { | |||
// If python is already dead, leak the wrapped python objects | |||
if (Py_IsInitialized()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this Py_IsInitialized()
will run for every dec_ref()
, even in production (NDEBUG
) code. I don't think that's a viable solution.
Moving this inside the #ifdef
would be more acceptable, but ultimately the current code admits that dec_ref()
is called from somewhere when the interpreter is dead already. That's the real problem.
Speculating:
You have
- some C++ object that has
py::object
(or derived) members. - or a
static py::object
(or derived).
The detors run at teardown.
You land here. Is that correct?
I think that code would need to be cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or a static py::object (or derived).
This one! The doc to create custom exception suggests doing that (see link in the PR description)!
I think that code would need to be cleaned up.
Does that mean the doc about exception needs to change?
With this Py_IsInitialized() will run for every dec_ref(), even in production (NDEBUG) code. I don't think that's a viable solution.
This function is quite cheap. It looksup a global struct so that should be very cheap.
As a data point, we're doing this in PyTorch today and never had any perf issue with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean the doc about exception needs to change?
Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.
Almost exactly that situation. I discovered that also through the GIL guard (many months ago).
This function is quite cheap. It looksup a global struct so that should be very cheap.
It doesn't feel right even from a general code health viewpoint, just the added complexity.
Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.
Then it's only to address the teardown situation, which really shouldn't be that hard to clean up, then the rest of the code can stay pure and simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we could have a type derived from object
that overrides dec_ref(()
in the way you want. But even that doesn't seem great. If it's ok to leak when the interpreter is dead already, then we might as well leak it unconditionally, especially when it's a static
variable: that only triggers at process teardown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.
Yes we had extended discussions about this on our end. You can see this comment for
Then it's only to address the teardown situation, which really shouldn't be that hard to clean up
It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive the python runtime (but are not necessarily global). For example some C++ cache are still populated and hold onto python object until after the python runtime dies. In that case, it is simpler to make our objects smoothly handle the case where they outlive the Python runtime rather than adding new indirections in these C++ objects to be able to check python runtime state (when it is available at runtime).
I think more generally, this happens because we're trying to discourage usage of PyObject*
or any manual refcounting to reduce bugs in our code. In particular having temporary global states that are non-owning are dangerous as we can easily forget to do a manual decref when replacing the current field.
Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.
Should I send a PR suggesting changing the exception suggestion to the following?
// This is a static object, so we must leak it.
static py::handle exc = py::exception<MyCustomException>(m, "MyCustomError").release();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive
Did you consider the idea under #4769 (comment)?
Then all you'd need to do is change object
to (e.g.) maybe_leaking_object
around your code.
To help you pin-point what needs to be marked as maybe leaking, you could change the GIL guarding code to also report (PyTypeObject *) Py_TYPE(m_ptr))->tp_name
.
Should I send a PR suggesting changing the exception suggestion to the following?
Unless the maybe leaking object idea works out and works for you, then we could advertise it in the docs, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider the idea under
Yes we already do that for our ~equivalent of py::object
: https://github.com/pytorch/pytorch/blob/60e65a70e54fe47d07c972082cf1ccb8614c06fb/torch/csrc/utils/object_ptr.cpp (this is rather new)
Also all our "historical" bindings are done via the CAPI directly and have this same fix (this has been added slowly over the past several years as we get in the habit of running with pydebug more and more).
Only a few objects on our end are bound with pybind and none of them had this issue until now.
I do agree that we could have a separate object subclass that has this behavior, and that should fix this problem.
Your suggestion is to add such a subclass to pybind11 once we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion is to add such a subclass to pybind11 once we need it?
Yes. Assuming it's only a small addition I think that'll be fine.
Might be best in a separate header that isn't included with pybind11.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I missing something here: https://github.com/pytorch/pytorch/pull/106401/files
I don't seem to be able to get the release()-ed py::exception to be used at all. Am I missing some obvious C++ thing or that is not even a good workaround to release the exception and we must introduce a maybe_leaked_object in pybind to support this properly?
We don't have the final conclusion yet but we're definitely not going to do this (fixing the lifetime of the obejcts instead). |
Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor. See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors. Pull Request resolved: #106401 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor. See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors. Pull Request resolved: pytorch#106401 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
Description
This change guards GIL check and decref for objects behind a check that the Python runtime still exists.
This can happen if a static variable outlives the python runtime and is destroyed later. This happens when the user uses the recommended way to create custom exceptions from https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html as
static py::exception<MyCustomException> exc(m, "MyCustomError");
In particular, this change fixes the segfault described at pytorch/pytorch#106083 (when working on upgrading PyTorch to 3.12)
Suggested changelog entry:
Skip calling Python decref when the Python runtime is already dead.