Skip to content

Do not check the GIL/decref object if the Python runtime is dead #4769

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

albanD
Copy link
Contributor

@albanD albanD commented Jul 31, 2023

Description

This change guards GIL check and decref for objects behind a check that the Python runtime still exists.
This can happen if a static variable outlives the python runtime and is destroyed later. This happens when the user uses the recommended way to create custom exceptions from https://pybind11.readthedocs.io/en/stable/advanced/exceptions.html as static py::exception<MyCustomException> exc(m, "MyCustomError");

In particular, this change fixes the segfault described at pytorch/pytorch#106083 (when working on upgrading PyTorch to 3.12)

Suggested changelog entry:

Skip calling Python decref when the Python runtime is already dead.

@@ -265,12 +265,15 @@ class handle : public detail::object_api<handle> {
this function automatically. Returns a reference to itself.
\endrst */
const handle &dec_ref() const & {
// If python is already dead, leak the wrapped python objects
if (Py_IsInitialized()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this Py_IsInitialized() will run for every dec_ref(), even in production (NDEBUG) code. I don't think that's a viable solution.

Moving this inside the #ifdef would be more acceptable, but ultimately the current code admits that dec_ref() is called from somewhere when the interpreter is dead already. That's the real problem.

Speculating:

You have

  • some C++ object that has py::object (or derived) members.
  • or a static py::object (or derived).

The detors run at teardown.

You land here. Is that correct?

I think that code would need to be cleaned up.

Copy link
Contributor Author

@albanD albanD Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or a static py::object (or derived).

This one! The doc to create custom exception suggests doing that (see link in the PR description)!

I think that code would need to be cleaned up.

Does that mean the doc about exception needs to change?

With this Py_IsInitialized() will run for every dec_ref(), even in production (NDEBUG) code. I don't think that's a viable solution.

This function is quite cheap. It looksup a global struct so that should be very cheap.
As a data point, we're doing this in PyTorch today and never had any perf issue with it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the doc about exception needs to change?

Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.

Dejavu: https://github.com/pybind/pybind11_abseil/blob/abc6790db41d3bb8bbd4923b966119a9a4d5296c/pybind11_abseil/register_status_bindings.cc#L144-L146

Almost exactly that situation. I discovered that also through the GIL guard (many months ago).

This function is quite cheap. It looksup a global struct so that should be very cheap.

It doesn't feel right even from a general code health viewpoint, just the added complexity.

Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.

Then it's only to address the teardown situation, which really shouldn't be that hard to clean up, then the rest of the code can stay pure and simple.

Copy link
Collaborator

@rwgk rwgk Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we could have a type derived from object that overrides dec_ref(() in the way you want. But even that doesn't seem great. If it's ok to leak when the interpreter is dead already, then we might as well leak it unconditionally, especially when it's a static variable: that only triggers at process teardown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe the transition from "is initialized" to "not anymore" isn't atomic. There could be traps lurking in the twilight zone. Better stay clean == fewer doubts.

Yes we had extended discussions about this on our end. You can see this comment for

https://github.com/pytorch/pytorch/blob/1a6f1d816d1615fb2a9d4bd8627e597b4df4920c/torch/csrc/utils/pybind.h#L380-L387

Then it's only to address the teardown situation, which really shouldn't be that hard to clean up

It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive the python runtime (but are not necessarily global). For example some C++ cache are still populated and hold onto python object until after the python runtime dies. In that case, it is simpler to make our objects smoothly handle the case where they outlive the Python runtime rather than adding new indirections in these C++ objects to be able to check python runtime state (when it is available at runtime).

I think more generally, this happens because we're trying to discourage usage of PyObject* or any manual refcounting to reduce bugs in our code. In particular having temporary global states that are non-owning are dangerous as we can easily forget to do a manual decref when replacing the current field.

Oh ... yeah, that's a bad suggestion in the docs. I totally missed that before.

Should I send a PR suggesting changing the exception suggestion to the following?

// This is a static object, so we must leak it.
static py::handle exc = py::exception<MyCustomException>(m, "MyCustomError").release(); 

Copy link
Collaborator

@rwgk rwgk Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not that trivial on our end as we have some Python object held by C++ structures that can very often outlive

Did you consider the idea under #4769 (comment)?

Then all you'd need to do is change object to (e.g.) maybe_leaking_object around your code.

To help you pin-point what needs to be marked as maybe leaking, you could change the GIL guarding code to also report (PyTypeObject *) Py_TYPE(m_ptr))->tp_name.

Should I send a PR suggesting changing the exception suggestion to the following?

Unless the maybe leaking object idea works out and works for you, then we could advertise it in the docs, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider the idea under

Yes we already do that for our ~equivalent of py::object: https://github.com/pytorch/pytorch/blob/60e65a70e54fe47d07c972082cf1ccb8614c06fb/torch/csrc/utils/object_ptr.cpp (this is rather new)
Also all our "historical" bindings are done via the CAPI directly and have this same fix (this has been added slowly over the past several years as we get in the habit of running with pydebug more and more).

Only a few objects on our end are bound with pybind and none of them had this issue until now.
I do agree that we could have a separate object subclass that has this behavior, and that should fix this problem.
Your suggestion is to add such a subclass to pybind11 once we need it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion is to add such a subclass to pybind11 once we need it?

Yes. Assuming it's only a small addition I think that'll be fine.
Might be best in a separate header that isn't included with pybind11.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something here: https://github.com/pytorch/pytorch/pull/106401/files
I don't seem to be able to get the release()-ed py::exception to be used at all. Am I missing some obvious C++ thing or that is not even a good workaround to release the exception and we must introduce a maybe_leaked_object in pybind to support this properly?

@albanD
Copy link
Contributor Author

albanD commented Aug 4, 2023

We don't have the final conclusion yet but we're definitely not going to do this (fixing the lifetime of the obejcts instead).
See pytorch/pytorch#106401 for more details on the discussion.

@albanD albanD closed this Aug 4, 2023
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 7, 2023
Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor.
See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors.
Pull Request resolved: #106401
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
Cyril-Anto pushed a commit to Cyril-Anto/pytorch that referenced this pull request Aug 17, 2023
Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor.
See pybind/pybind11#4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors.
Pull Request resolved: pytorch#106401
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants