Skip to content

Avoid thread termination in scoped_released #2657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 19, 2020
Merged

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Nov 13, 2020

Description

Do not call PyEval_RestoreThread() from ~gil_scoped_release() if python runtime is finalizing, as it will result in thread termination in Python runtime newer than 3.6, as documented in https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread
Similarly do not call PyThreadState_DeleteCurrent from ~gil_scoped_acquire() if runtime is finalizing.

Discovered while debugging PyTorch crash using Python-3.9 described in pytorch/pytorch#47776

Suggested changelog entry:

Avoid unwanted termination if `gil_scoped_release()` is destroyed while Python runtime is finalizing 

Do not call `PyEval_RestoreThread()` from `~gil_scoped_release()` if python runtime is finalizing, as it will result in thread termination in Python runtime newer than 3.6, as documented in https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread
Similarly do not call `PyThreadState_DeleteCurrent` from `~gil_scoped_acquire()` if runtime is finalizing.

Discovered while debugging PyTorch crash using Python-3.9 described in  pytorch/pytorch#47776
@malfet
Copy link
Contributor Author

malfet commented Nov 13, 2020

@henryiii please let me know if change makes sense to you (and suggest who could review it)

Copy link
Collaborator

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @rwgk & @YannickJadoul might want to check/verify.

Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's what the docs say, definitely!

2 small things:

  • What about 3.6 and before? Seems the behavior was just not documented? Do we need to always call it, or could we use the post-3.6 code for 3.6 and earlier as well?
  • I guess there's no point in mixing compile time and runtime checks to collapse this into if (PY_VERSION_HEX < 0x0307000 || !_Py_IsFinalizing())? I guess it just changes two times one line of code duplication, but that's probably not worth the potential static analysis warnings?

@YannickJadoul
Copy link
Collaborator

If that's what the docs say, definitely!

Thanks for catching/fixing this, btw, @malfet! :-)

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much trouble is it to add a minimal unit test that is expected to work reliably only with this change? (I'm OK without adding a test if the setup is too much trouble.)

// See https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread
if (!_Py_IsFinalizing())
PyEval_RestoreThread(tstate);
#else
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: typo: finilizing

More significantly: I think the code could be made more readable and maintainable by centralizing the #if #else #endif in a, e.g., detail::finalization_guard() inline helper function, so that there is a formal link to the comment from both places, and less preprocessor clutter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that that's not just going to add more confusion, to have another indirection? As long as it's just these two (very related) spots, to me, this seem reasonably fine? (We currently have these kinds of things inline in other places as well, and this should make it easy to remove once we drop <= 3.6?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The practical version of DRY/SPOT that I've seen states that exactly one duplication is allowed, if it's clearly more complex to combine it. Trying to take DRY/SPOT to the extreme of truly one point of truth for all code can impose an unmanageable level of complexity, so allowing it to be relaxed by one at programmer discretion balances the ideal with practicality. (and if there's more than one duplication, the extra complexity is always worth it).

In short, I'm fine either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Henry! I hadn't heard of DRY/SPOT before, good to know there is something to refer to.

In this particular case I'm not so much concerned about the code duplication, but mostly about the comment only appearing in one place. For someone arriving at the second place from some completely different context, they will not even know the comment exists in the other place, and may miss it. But if they get curious about detail::finalization_guard(), they are guaranteed to find the comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is true, though! I completely agree there. There should at least be a "see a few lines up/down, at ...".

Apart from that, I don't feel too strongly, so up to @malfet AFAIC, but I just though the #if SOME_PYTHON_VERSION pattern exists inline in lots of places of code.
And it's kind of hard to capture what it's doing (i.e., finalization_guard doesn't really describe what's happening) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. Fixed typo and moved and extended runtime finalization check to pybind11/detail/internals.h

@henryiii
Copy link
Collaborator

How much trouble is it to add a minimal unit test that is expected to work reliably only with this change

If it only happens on interpreter finalization, it would have to be a test that spawns a Python instance. Probably not trivial, but not awful either (we have some already like that).

@henryiii henryiii added this to the v2.6.2 milestone Nov 15, 2020
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks!
Super minor, in two place you have "runtime is finalizing". It feels like a "the" is missing: "the runtime is finalizing".

@malfet
Copy link
Contributor Author

malfet commented Nov 17, 2020

  • What about 3.6 and before? Seems the behavior was just not documented? Do we need to always call it, or could we use the post-3.6 code for 3.6 and earlier as well?

I don't know to tell the truth, but there is a _Py_Finalizing global variable in 3.5 and 3.6, which probably should be used to detect if runtime is winding down.

@bstaletic
Copy link
Collaborator

Suppose we have two threads. One is holding the GIL and started to finalize the interpreter. The other wants to use a CPython API and so first calls py::gil_scoped_acquire guard;. However, py::gil_scoped_acquire does exactly nothing because the interpreter is shutting down, thus the thread "thinks" it acquired the GIL, when it actually didn't. Now that thread accessing some CPython API without holding the GIL. We all know that's no good.

Any thoughts regarding that?

And rename it to `is_finalizing`
Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this already makes me happier to see detail::is_finalizing and not in internals.h :-) Thanks!

I do still want to see @bstaletic's comment addressed though. I don't know about all the GIL/threading internals of Python, but @bstaletic's points seems to be a valid concern, where we might be breaking other code?

@malfet
Copy link
Contributor Author

malfet commented Nov 17, 2020

Suppose we have two threads. One is holding the GIL and started to finalize the interpreter. The other wants to use a CPython API and so first calls py::gil_scoped_acquire guard;. However, py::gil_scoped_acquire does exactly nothing because the interpreter is shutting down, thus the thread "thinks" it acquired the GIL, when it actually didn't.

@bstaletic This scenario is unrelated to the change, as it does not change the behavior of py::gil_scoped_acquire constructor. But in the separate PR (or may be even in the same one) I would argue, that pybind11_fail() should be raised from constructor if runtime is being finalized to match the termination behavior documented at https://docs.python.org/3/c-api/init.html#c.PyEval_AcquireThread

@malfet malfet requested a review from YannickJadoul November 17, 2020 18:01
@bstaletic
Copy link
Collaborator

the change [...] does not change the behavior of py::gil_scoped_acquire constructor.

Right, my bad. What about a thread that's doing somethting like this?

#include <pybind11/pybind11.h>
auto the_other_thread() {
	{
		pybind11::gil_scoped_release g;
		while(!_Py_IsFinalizing());
	}
	pybind11::list l;
	for(auto i = 0; i < 1000000; ++i) l.append(pybind11::str("something long"));
}

PYBIND11_MODULE(foo, m) {
	m.def("the_other_thread", the_other_thread);
}

Obviously, this snippet shouldn't pass any kind of code review, but the point is that the thread might attempt to do anything with the CPython API. Before this PR, it would try to acquire the gil and would get killed. This makes the problem manifest as close as possible to the cause of the same problem. How do things change with this PR? If I understand things correctly, since the destructor of gil_scoped_release doesn't acquire the GIL, "the other thread" ends up playing with the python API, also, without holding the GIL. Again, if I'm correct, that would make the cause (attempting to acquire the GIL at a horrible moment) "far away" from the effect (hopefully a SEGFAULT).

@malfet
Copy link
Contributor Author

malfet commented Nov 17, 2020

Obviously, this snippet shouldn't pass any kind of code review, but the point is that the thread might attempt to do anything with the CPython API. Before this PR, it would try to acquire the gil and would get killed. This makes the problem manifest as close as possible to the cause of the same problem. How do things change with this PR? If I understand things correctly, since the destructor of gil_scoped_release doesn't acquire the GIL, "the other thread" ends up playing with the python API, also, without holding the GIL.

Hmm. In that case, would throwing a runtime_error exception results in a better behaviour?

@bstaletic
Copy link
Collaborator

We're talking about ~gil_scoped_release(). In C++ destructors are noexcept by default, so throwing from one would actually call std::terminate(). Before anyone proposes adding noexcept(false) to the destructor, the problem with that is that you can't have two exceptions in flight at the same time, else std::terminate(). So, if ~gil_scoped_release() was called because something already "exploded" and threw an exception then throwing a runtime_error from ~gil_scoped_release() gets us back to square 1 - the thread gets terminated.

Now my first thought was "bad things are already happening, it's no worse than right now", but that might not be true. If we allow ~git_scoped_release() to throw, then, as far as we're aware, there would be to "sources" of exceptions from inside this destructor.

  1. The proposed runtime error.
  2. Add noexcept(false) to destructors for gil_scoped_release #2215

To be clear, throwing destructors still make me very anxious. However, this might be a non-evil exception. If we are absolutely 100% positive that only the code that was already horribly broken (such as my snippet above) would stay horribly broken, this throwing destructor might be okay. To whomever is still reading, when thinking about this, consider a really bad situation:

  1. One thread is finalizing the interpreter.
  2. The other thread is detached and actually outlives the interpreter shutting down, as in Add noexcept(false) to destructors for gil_scoped_release #2215
  3. That same other thread also throws a C++ exception, causing the stack to unwind, instead of return normally.
  4. The ~gil_scoped_release() also tries to throw an exception.

#2215 left one thing unanswered. In the bad scenario described in #2215, we know the behaviour of POSIX systems, but we have no idea how Windows would behave, as that question was left unanswered. Which, to be honest, makes me even more wary of throwing an exception from ~gil_scoped_release().


Anyway, that was me trying to convince myself that a throwing destructor might be a viable option. A completely different solution, though not great, could be a sort of an escape hatch. Something like gil_scoped_acquire::release() and gil_scoped_release::release() that would "disarm" the appropriate member function to make it do nothing. This would pass the ball back to the users, making them responsible for disarming these RAII guards if they are sure that disarming is safe.

And yes, I'm fully aware of the downsides that are involved in actually using such a mechanism. I'm just thinking out loud. None of the ideas I've thought of are perfect and I'm absolutely open for ideas. I'm also open to pybind11 adopting a least bad solution as long as we understand what the implications are.

@henryiii
Copy link
Collaborator

henryiii commented Nov 24, 2020

How about gil_scoped_X::deactivate()? Then you'd call:

#include <pybind11/pybind11.h>
auto the_other_thread() {
	{
		pybind11::gil_scoped_release g;
		while(!_Py_IsFinalizing());
                if(_Py_IsFinalizing())
                    g.deactivate();
	}
        // Now it's obvious that you may not have the gil here, and you could add an if !_Py_IsFinalizing() protection.
	pybind11::list l;
	for(auto i = 0; i < 1000000; ++i) l.append(pybind11::str("something long"));
}

Looking at it, I like disarm_destructor or deactivate_destructor better, since it's clear that it's simply changing the destructor process, rather than doing something right where it is called.

@henryiii
Copy link
Collaborator

And, of course, for pytorch, you could then subclass these two guards, and add a conditional if that calls this in the destructor.

@henryiii
Copy link
Collaborator

henryiii commented Dec 16, 2020

I think we decided on disarm.

Should disarm be defined but a no-op on PyPy? And if not threading, should disarm still be defined, but (like the whole class) be a no-op?

Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this is pretty slim/clean!

2 more things:

  • Do we need a test? As far as I'm concerned, we're following the Python docs and a test would probably be more hazardous than the actual straightforwards code changes we have here. Also the GIL RAII classes are already tested.
  • Can we confirm this solution also works for @malfet? In principle, you could now still wrap this in a PyTorch-specific RAII wrapper, right? It's just that at least disarm will show the danger and not do dangerous things by default, but functionally, the same should be possible.

@henryiii
Copy link
Collaborator

No test, unless we want to just call it once to make sure it doesn't vanish into thin air someday. Trying to run a test on interpreter shutdown is likely very tricky.

@YannickJadoul
Copy link
Collaborator

No test, unless we want to just call it once to make sure it doesn't vanish into thin air someday. Trying to run a test on interpreter shutdown is likely very tricky.

Agreed!

@bstaletic
Copy link
Collaborator

Trying to run a test on interpreter shutdown is likely very tricky.

I tried and that's an understatement.

@henryiii henryiii closed this Dec 19, 2020
@henryiii henryiii reopened this Dec 19, 2020
@henryiii henryiii merged commit 79cb013 into pybind:master Dec 19, 2020
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Dec 19, 2020
@henryiii henryiii removed the needs changelog Possibly needs a changelog entry label Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants