Do not check the GIL/decref object if the Python runtime is dead #4769

albanD · 2023-07-31T21:23:38Z

Description

This change guards GIL check and decref for objects behind a check that the Python runtime still exists.
This can happen if a static variable outlives the python runtime and is destroyed later. This happens when the user uses the recommended way to create custom exceptions from https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html as static py::exception<MyCustomException> exc(m, "MyCustomError");

In particular, this change fixes the segfault described at pytorch/pytorch#106083 (when working on upgrading PyTorch to 3.12)

Suggested changelog entry:

Skip calling Python decref when the Python runtime is already dead.

rwgk · 2023-07-31T21:42:34Z

include/pybind11/pytypes.h

@@ -265,12 +265,15 @@ class handle : public detail::object_api<handle> {
        this function automatically. Returns a reference to itself.
    \endrst */
    const handle &dec_ref() const & {
+        // If python is already dead, leak the wrapped python objects
+        if (Py_IsInitialized()) {


With this Py_IsInitialized() will run for every dec_ref(), even in production (NDEBUG) code. I don't think that's a viable solution.

Moving this inside the #ifdef would be more acceptable, but ultimately the current code admits that dec_ref() is called from somewhere when the interpreter is dead already. That's the real problem.

Speculating:

You have

some C++ object that has py::object (or derived) members.

or a static py::object (or derived).

The detors run at teardown.

You land here. Is that correct?

I think that code would need to be cleaned up.

or a static py::object (or derived).

This one! The doc to create custom exception suggests doing that (see link in the PR description)!

I think that code would need to be cleaned up.

Does that mean the doc about exception needs to change?

With this Py_IsInitialized() will run for every dec_ref(), even in production (NDEBUG) code. I don't think that's a viable solution.

This function is quite cheap. It looksup a global struct so that should be very cheap.
As a data point, we're doing this in PyTorch today and never had any perf issue with it.

Does that mean the doc about exception needs to change?

Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.

Dejavu: https://github.com/pybind/pybind11_abseil/blob/abc6790db41d3bb8bbd4923b966119a9a4d5296c/pybind11_abseil/register_status_bindings.cc#L144-L146

Almost exactly that situation. I discovered that also through the GIL guard (many months ago).

This function is quite cheap. It looksup a global struct so that should be very cheap.

It doesn't feel right even from a general code health viewpoint, just the added complexity.

Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.

Then it's only to address the teardown situation, which really shouldn't be that hard to clean up, then the rest of the code can stay pure and simple.

In theory we could have a type derived from object that overrides dec_ref(() in the way you want. But even that doesn't seem great. If it's ok to leak when the interpreter is dead already, then we might as well leak it unconditionally, especially when it's a static variable: that only triggers at process teardown.

Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.

Yes we had extended discussions about this on our end. You can see this comment for

https://github.com/pytorch/pytorch/blob/1a6f1d816d1615fb2a9d4bd8627e597b4df4920c/torch/csrc/utils/pybind.h#L380-L387

Then it's only to address the teardown situation, which really shouldn't be that hard to clean up

It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive the python runtime (but are not necessarily global). For example some C++ cache are still populated and hold onto python object until after the python runtime dies. In that case, it is simpler to make our objects smoothly handle the case where they outlive the Python runtime rather than adding new indirections in these C++ objects to be able to check python runtime state (when it is available at runtime).

I think more generally, this happens because we're trying to discourage usage of PyObject* or any manual refcounting to reduce bugs in our code. In particular having temporary global states that are non-owning are dangerous as we can easily forget to do a manual decref when replacing the current field.

Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.

Should I send a PR suggesting changing the exception suggestion to the following?

// This is a static object, so we must leak it. static py::handle exc = py::exception<MyCustomException>(m, "MyCustomError").release();

It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive

Did you consider the idea under #4769 (comment)?

Then all you'd need to do is change object to (e.g.) maybe_leaking_object around your code.

To help you pin-point what needs to be marked as maybe leaking, you could change the GIL guarding code to also report (PyTypeObject *) Py_TYPE(m_ptr))->tp_name.

Should I send a PR suggesting changing the exception suggestion to the following?

Unless the maybe leaking object idea works out and works for you, then we could advertise it in the docs, too.

Did you consider the idea under

Yes we already do that for our ~equivalent of py::object: https://github.com/pytorch/pytorch/blob/60e65a70e54fe47d07c972082cf1ccb8614c06fb/torch/csrc/utils/object_ptr.cpp (this is rather new)
Also all our "historical" bindings are done via the CAPI directly and have this same fix (this has been added slowly over the past several years as we get in the habit of running with pydebug more and more).

Only a few objects on our end are bound with pybind and none of them had this issue until now.
I do agree that we could have a separate object subclass that has this behavior, and that should fix this problem.
Your suggestion is to add such a subclass to pybind11 once we need it?

Your suggestion is to add such a subclass to pybind11 once we need it?

Yes. Assuming it's only a small addition I think that'll be fine.
Might be best in a separate header that isn't included with pybind11.h

Am I missing something here: https://github.com/pytorch/pytorch/pull/106401/files
I don't seem to be able to get the release()-ed py::exception to be used at all. Am I missing some obvious C++ thing or that is not even a good workaround to release the exception and we must introduce a maybe_leaked_object in pybind to support this properly?

albanD · 2023-08-04T14:50:11Z

We don't have the final conclusion yet but we're definitely not going to do this (fixing the lifetime of the obejcts instead).
See pytorch/pytorch#106401 for more details on the discussion.

Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor. See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors. Pull Request resolved: #106401 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007

Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor. See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors. Pull Request resolved: pytorch#106401 Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007

Do not check the GIL/decref object if the Python runtime is dead

50d68f2

rwgk reviewed Jul 31, 2023

View reviewed changes

fix lint

da09582

albanD mentioned this pull request Aug 1, 2023

Fix lifetime of JITException binding pytorch/pytorch#106401

Closed

rwgk mentioned this pull request Aug 2, 2023

feat: add py::set_error(), use in updated py::exception<> documentation #4772

Merged

albanD closed this Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not check the GIL/decref object if the Python runtime is dead #4769

Do not check the GIL/decref object if the Python runtime is dead #4769

Uh oh!

albanD commented Jul 31, 2023

Uh oh!

rwgk Jul 31, 2023

Uh oh!

albanD Jul 31, 2023 •

edited

Loading

Uh oh!

rwgk Jul 31, 2023

Uh oh!

rwgk Jul 31, 2023 •

edited

Loading

Uh oh!

albanD Aug 1, 2023

Uh oh!

rwgk Aug 1, 2023 •

edited

Loading

Uh oh!

albanD Aug 1, 2023

Uh oh!

rwgk Aug 1, 2023

Uh oh!

albanD Aug 2, 2023

Uh oh!

albanD commented Aug 4, 2023

Uh oh!

Uh oh!

Do not check the GIL/decref object if the Python runtime is dead #4769

Do not check the GIL/decref object if the Python runtime is dead #4769

Uh oh!

Conversation

albanD commented Jul 31, 2023

Description

Suggested changelog entry:

Uh oh!

rwgk Jul 31, 2023

Choose a reason for hiding this comment

Uh oh!

albanD Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk Jul 31, 2023

Choose a reason for hiding this comment

Uh oh!

rwgk Jul 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

rwgk Aug 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albanD Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

rwgk Aug 1, 2023

Choose a reason for hiding this comment

Uh oh!

albanD Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

albanD commented Aug 4, 2023

Uh oh!

Uh oh!

albanD Jul 31, 2023 •

edited

Loading

rwgk Jul 31, 2023 •

edited

Loading

rwgk Aug 1, 2023 •

edited

Loading