Avoid thread termination in scoped_released #2657

malfet · 2020-11-13T02:20:37Z

Description

Do not call PyEval_RestoreThread() from ~gil_scoped_release() if python runtime is finalizing, as it will result in thread termination in Python runtime newer than 3.6, as documented in https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread
Similarly do not call PyThreadState_DeleteCurrent from ~gil_scoped_acquire() if runtime is finalizing.

Discovered while debugging PyTorch crash using Python-3.9 described in pytorch/pytorch#47776

Suggested changelog entry:

Avoid unwanted termination if `gil_scoped_release()` is destroyed while Python runtime is finalizing

Do not call `PyEval_RestoreThread()` from `~gil_scoped_release()` if python runtime is finalizing, as it will result in thread termination in Python runtime newer than 3.6, as documented in https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread Similarly do not call `PyThreadState_DeleteCurrent` from `~gil_scoped_acquire()` if runtime is finalizing. Discovered while debugging PyTorch crash using Python-3.9 described in pytorch/pytorch#47776

malfet · 2020-11-13T07:53:45Z

@henryiii please let me know if change makes sense to you (and suggest who could review it)

include/pybind11/pybind11.h

henryiii

Looks good to me. @rwgk & @YannickJadoul might want to check/verify.

YannickJadoul

If that's what the docs say, definitely!

2 small things:

What about 3.6 and before? Seems the behavior was just not documented? Do we need to always call it, or could we use the post-3.6 code for 3.6 and earlier as well?
I guess there's no point in mixing compile time and runtime checks to collapse this into if (PY_VERSION_HEX < 0x0307000 || !_Py_IsFinalizing())? I guess it just changes two times one line of code duplication, but that's probably not worth the potential static analysis warnings?

YannickJadoul · 2020-11-13T23:30:12Z

If that's what the docs say, definitely!

Thanks for catching/fixing this, btw, @malfet! :-)

rwgk

How much trouble is it to add a minimal unit test that is expected to work reliably only with this change? (I'm OK without adding a test if the setup is too much trouble.)

rwgk · 2020-11-14T04:02:35Z

include/pybind11/pybind11.h

+        // See https://docs.python.org/3/c-api/init.html#c.PyEval_RestoreThread
+        if (!_Py_IsFinalizing())
+            PyEval_RestoreThread(tstate);
+#else


Minor: typo: finilizing

More significantly: I think the code could be made more readable and maintainable by centralizing the #if #else #endif in a, e.g., detail::finalization_guard() inline helper function, so that there is a formal link to the comment from both places, and less preprocessor clutter.

Sure that that's not just going to add more confusion, to have another indirection? As long as it's just these two (very related) spots, to me, this seem reasonably fine? (We currently have these kinds of things inline in other places as well, and this should make it easy to remove once we drop <= 3.6?)

The practical version of DRY/SPOT that I've seen states that exactly one duplication is allowed, if it's clearly more complex to combine it. Trying to take DRY/SPOT to the extreme of truly one point of truth for all code can impose an unmanageable level of complexity, so allowing it to be relaxed by one at programmer discretion balances the ideal with practicality. (and if there's more than one duplication, the extra complexity is always worth it).

In short, I'm fine either way.

Thanks Henry! I hadn't heard of DRY/SPOT before, good to know there is something to refer to.

In this particular case I'm not so much concerned about the code duplication, but mostly about the comment only appearing in one place. For someone arriving at the second place from some completely different context, they will not even know the comment exists in the other place, and may miss it. But if they get curious about detail::finalization_guard(), they are guaranteed to find the comment.

The comment is true, though! I completely agree there. There should at least be a "see a few lines up/down, at ...".

Apart from that, I don't feel too strongly, so up to @malfet AFAIC, but I just though the #if SOME_PYTHON_VERSION pattern exists inline in lots of places of code.
And it's kind of hard to capture what it's doing (i.e., finalization_guard doesn't really describe what's happening) ?

Thank you for the feedback. Fixed typo and moved and extended runtime finalization check to pybind11/detail/internals.h

henryiii · 2020-11-15T17:00:13Z

How much trouble is it to add a minimal unit test that is expected to work reliably only with this change

If it only happens on interpreter finalization, it would have to be a test that spawns a Python instance. Probably not trivial, but not awful either (we have some already like that).

rwgk

Looks good to me, thanks!
Super minor, in two place you have "runtime is finalizing". It feels like a "the" is missing: "the runtime is finalizing".

malfet · 2020-11-17T03:23:34Z

What about 3.6 and before? Seems the behavior was just not documented? Do we need to always call it, or could we use the post-3.6 code for 3.6 and earlier as well?

I don't know to tell the truth, but there is a _Py_Finalizing global variable in 3.5 and 3.6, which probably should be used to detect if runtime is winding down.

bstaletic · 2020-11-17T07:21:00Z

Suppose we have two threads. One is holding the GIL and started to finalize the interpreter. The other wants to use a CPython API and so first calls py::gil_scoped_acquire guard;. However, py::gil_scoped_acquire does exactly nothing because the interpreter is shutting down, thus the thread "thinks" it acquired the GIL, when it actually didn't. Now that thread accessing some CPython API without holding the GIL. We all know that's no good.

Any thoughts regarding that?

include/pybind11/detail/internals.h

And rename it to `is_finalizing`

YannickJadoul

Yup, this already makes me happier to see detail::is_finalizing and not in internals.h :-) Thanks!

I do still want to see @bstaletic's comment addressed though. I don't know about all the GIL/threading internals of Python, but @bstaletic's points seems to be a valid concern, where we might be breaking other code?

include/pybind11/detail/common.h

malfet · 2020-11-17T18:01:14Z

Suppose we have two threads. One is holding the GIL and started to finalize the interpreter. The other wants to use a CPython API and so first calls py::gil_scoped_acquire guard;. However, py::gil_scoped_acquire does exactly nothing because the interpreter is shutting down, thus the thread "thinks" it acquired the GIL, when it actually didn't.

@bstaletic This scenario is unrelated to the change, as it does not change the behavior of py::gil_scoped_acquire constructor. But in the separate PR (or may be even in the same one) I would argue, that pybind11_fail() should be raised from constructor if runtime is being finalized to match the termination behavior documented at https://docs.python.org/3/c-api/init.html#c.PyEval_AcquireThread

bstaletic · 2020-11-17T18:38:16Z

the change [...] does not change the behavior of py::gil_scoped_acquire constructor.

Right, my bad. What about a thread that's doing somethting like this?

#include <pybind11/pybind11.h>
auto the_other_thread() {
	{
		pybind11::gil_scoped_release g;
		while(!_Py_IsFinalizing());
	}
	pybind11::list l;
	for(auto i = 0; i < 1000000; ++i) l.append(pybind11::str("something long"));
}

PYBIND11_MODULE(foo, m) {
	m.def("the_other_thread", the_other_thread);
}

Obviously, this snippet shouldn't pass any kind of code review, but the point is that the thread might attempt to do anything with the CPython API. Before this PR, it would try to acquire the gil and would get killed. This makes the problem manifest as close as possible to the cause of the same problem. How do things change with this PR? If I understand things correctly, since the destructor of gil_scoped_release doesn't acquire the GIL, "the other thread" ends up playing with the python API, also, without holding the GIL. Again, if I'm correct, that would make the cause (attempting to acquire the GIL at a horrible moment) "far away" from the effect (hopefully a SEGFAULT).

malfet · 2020-11-17T19:43:44Z

Obviously, this snippet shouldn't pass any kind of code review, but the point is that the thread might attempt to do anything with the CPython API. Before this PR, it would try to acquire the gil and would get killed. This makes the problem manifest as close as possible to the cause of the same problem. How do things change with this PR? If I understand things correctly, since the destructor of gil_scoped_release doesn't acquire the GIL, "the other thread" ends up playing with the python API, also, without holding the GIL.

Hmm. In that case, would throwing a runtime_error exception results in a better behaviour?

bstaletic · 2020-11-17T21:05:33Z

We're talking about ~gil_scoped_release(). In C++ destructors are noexcept by default, so throwing from one would actually call std::terminate(). Before anyone proposes adding noexcept(false) to the destructor, the problem with that is that you can't have two exceptions in flight at the same time, else std::terminate(). So, if ~gil_scoped_release() was called because something already "exploded" and threw an exception then throwing a runtime_error from ~gil_scoped_release() gets us back to square 1 - the thread gets terminated.

Now my first thought was "bad things are already happening, it's no worse than right now", but that might not be true. If we allow ~git_scoped_release() to throw, then, as far as we're aware, there would be to "sources" of exceptions from inside this destructor.

The proposed runtime error.
Add noexcept(false) to destructors for gil_scoped_release #2215

To be clear, throwing destructors still make me very anxious. However, this might be a non-evil exception. If we are absolutely 100% positive that only the code that was already horribly broken (such as my snippet above) would stay horribly broken, this throwing destructor might be okay. To whomever is still reading, when thinking about this, consider a really bad situation:

One thread is finalizing the interpreter.
The other thread is detached and actually outlives the interpreter shutting down, as in Add noexcept(false) to destructors for gil_scoped_release #2215
That same other thread also throws a C++ exception, causing the stack to unwind, instead of return normally.
The ~gil_scoped_release() also tries to throw an exception.

#2215 left one thing unanswered. In the bad scenario described in #2215, we know the behaviour of POSIX systems, but we have no idea how Windows would behave, as that question was left unanswered. Which, to be honest, makes me even more wary of throwing an exception from ~gil_scoped_release().

Anyway, that was me trying to convince myself that a throwing destructor might be a viable option. A completely different solution, though not great, could be a sort of an escape hatch. Something like gil_scoped_acquire::release() and gil_scoped_release::release() that would "disarm" the appropriate member function to make it do nothing. This would pass the ball back to the users, making them responsible for disarming these RAII guards if they are sure that disarming is safe.

And yes, I'm fully aware of the downsides that are involved in actually using such a mechanism. I'm just thinking out loud. None of the ideas I've thought of are perfect and I'm absolutely open for ideas. I'm also open to pybind11 adopting a least bad solution as long as we understand what the implications are.

henryiii · 2020-11-24T17:39:29Z

How about gil_scoped_X::deactivate()? Then you'd call:

#include <pybind11/pybind11.h>
auto the_other_thread() {
	{
		pybind11::gil_scoped_release g;
		while(!_Py_IsFinalizing());
                if(_Py_IsFinalizing())
                    g.deactivate();
	}
        // Now it's obvious that you may not have the gil here, and you could add an if !_Py_IsFinalizing() protection.
	pybind11::list l;
	for(auto i = 0; i < 1000000; ++i) l.append(pybind11::str("something long"));
}

Looking at it, I like disarm_destructor or deactivate_destructor better, since it's clear that it's simply changing the destructor process, rather than doing something right where it is called.

henryiii · 2020-11-24T18:05:24Z

And, of course, for pytorch, you could then subclass these two guards, and add a conditional if that calls this in the destructor.

henryiii · 2020-12-16T04:25:43Z

I think we decided on disarm.

Should disarm be defined but a no-op on PyPy? And if not threading, should disarm still be defined, but (like the whole class) be a no-op?

include/pybind11/pybind11.h

YannickJadoul

Cool, this is pretty slim/clean!

2 more things:

Do we need a test? As far as I'm concerned, we're following the Python docs and a test would probably be more hazardous than the actual straightforwards code changes we have here. Also the GIL RAII classes are already tested.
Can we confirm this solution also works for @malfet? In principle, you could now still wrap this in a PyTorch-specific RAII wrapper, right? It's just that at least disarm will show the danger and not do dangerous things by default, but functionally, the same should be possible.

include/pybind11/pybind11.h

henryiii · 2020-12-17T04:05:18Z

No test, unless we want to just call it once to make sure it doesn't vanish into thin air someday. Trying to run a test on interpreter shutdown is likely very tricky.

YannickJadoul · 2020-12-17T14:22:07Z

No test, unless we want to just call it once to make sure it doesn't vanish into thin air someday. Trying to run a test on interpreter shutdown is likely very tricky.

Agreed!

bstaletic · 2020-12-17T14:35:57Z

Trying to run a test on interpreter shutdown is likely very tricky.

I tried and that's an understatement.

…on 3.9. See: * https://github.com/pybind/pybind11/releases/tag/v2.6.2 * pybind/pybind11#2657 and related: * pytorch/pytorch@6f3aa58 * pytorch/pytorch#50014

malfet force-pushed the patch-1 branch from 0f06d62 to 945ee95 Compare November 13, 2020 02:25

malfet force-pushed the patch-1 branch from 945ee95 to 7e8a4cf Compare November 13, 2020 02:34

bstaletic reviewed Nov 13, 2020

View reviewed changes

include/pybind11/pybind11.h Outdated Show resolved Hide resolved

Simplify _Py_IsFinalizing() availability check

9e8bd73

henryiii approved these changes Nov 13, 2020

View reviewed changes

YannickJadoul approved these changes Nov 13, 2020

View reviewed changes

rwgk reviewed Nov 14, 2020

View reviewed changes

henryiii added this to the v2.6.2 milestone Nov 15, 2020

malfet added 2 commits November 16, 2020 18:38

Fix typo

c82145d

Add version agnostic detail::finalization_guard()

7b70787

rwgk approved these changes Nov 17, 2020

View reviewed changes

YannickJadoul requested changes Nov 17, 2020

View reviewed changes

include/pybind11/detail/internals.h Outdated Show resolved Hide resolved

Move finalization_guard to detail/common.h

461e81a

And rename it to `is_finalizing`

malfet requested a review from YannickJadoul November 17, 2020 15:37

malfet mentioned this pull request Nov 17, 2020

Destroy autograd device threads at exit pytorch/pytorch#47886

Closed

YannickJadoul reviewed Nov 17, 2020

View reviewed changes

include/pybind11/detail/common.h Outdated Show resolved Hide resolved

Move is_finalizing() back to pybind11.h

f5804e3

malfet requested a review from YannickJadoul November 17, 2020 18:01

This was referenced Nov 17, 2020

third_party: Update pybind to point to fork pytorch/pytorch#48117

Closed

third_party: Update pybind to point to fork pytorch/pytorch#48120

Closed

seemethere mentioned this pull request Nov 20, 2020

[v1.7.1] third_party: Update pybind to point to fork pytorch/pytorch#48312

Merged

feat: disarm

bb8cee6

henryiii force-pushed the patch-1 branch from db7f5b9 to bb8cee6 Compare December 16, 2020 04:55

bstaletic reviewed Dec 16, 2020

View reviewed changes

include/pybind11/pybind11.h Outdated Show resolved Hide resolved

include/pybind11/pybind11.h Show resolved Hide resolved

YannickJadoul approved these changes Dec 16, 2020

View reviewed changes

include/pybind11/pybind11.h Outdated Show resolved Hide resolved

docs: fix comment

93639b8

henryiii closed this Dec 19, 2020

henryiii reopened this Dec 19, 2020

henryiii merged commit 79cb013 into pybind:master Dec 19, 2020

github-actions bot added the needs changelog Possibly needs a changelog entry label Dec 19, 2020

henryiii removed the needs changelog Possibly needs a changelog entry label Dec 22, 2020

mattip mentioned this pull request Jan 19, 2021

pybind11::gil_scoped_release: crash on exit with daemon threads pytorch/pytorch#38228

Open

diego-plan9 mentioned this pull request Jan 19, 2021

Revise Python 3.9 compatibility IBM/aihwkit#111

Closed

mattip mentioned this pull request Jan 21, 2021

Segmentation Fault when killing uwsgi workers using PyTorch pytorch/pytorch#42125

Closed

henryiii mentioned this pull request Jan 26, 2021

chore: prepare for the 2.6.2 release #2821

Merged

Flamefire mentioned this pull request Jul 20, 2021

Crash/Force termination in distributed training tensorflow/tensorflow#50853

Closed

zwimer mentioned this pull request Feb 25, 2022

Update to use pybind11 2.9.1 (Feb 2, 2022) RosettaCommons/binder#173

Open

virtuald mentioned this pull request Apr 20, 2022

Sendable::InitSendable: force builder to be a reference robotpy/robotpy-wpiutil#40

Merged

Flamefire mentioned this pull request Dec 9, 2022

Abort called in FSDP tests pytorch/pytorch#90549

Closed

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2657

Closed

Avoid thread termination in scoped_released #2657

Avoid thread termination in scoped_released #2657

Uh oh!

Conversation

malfet commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Suggested changelog entry:

Uh oh!

malfet commented Nov 13, 2020

Uh oh!

Uh oh!

henryiii left a comment

Choose a reason for hiding this comment

Uh oh!

YannickJadoul left a comment

Choose a reason for hiding this comment

Uh oh!

YannickJadoul commented Nov 13, 2020

Uh oh!

rwgk left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 14, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul Nov 14, 2020

Choose a reason for hiding this comment

Uh oh!

henryiii Nov 15, 2020

Choose a reason for hiding this comment

Uh oh!

rwgk Nov 15, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul Nov 15, 2020

Choose a reason for hiding this comment

Uh oh!

malfet Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

henryiii commented Nov 15, 2020

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 17, 2020

Uh oh!

bstaletic commented Nov 17, 2020

Uh oh!

Uh oh!

YannickJadoul left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

malfet commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bstaletic commented Nov 17, 2020

Uh oh!

malfet commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bstaletic commented Nov 17, 2020

Uh oh!

henryiii commented Nov 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 24, 2020

Uh oh!

henryiii commented Dec 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YannickJadoul left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 13, 2020 •

edited

Loading

rwgk left a comment •

edited

Loading

YannickJadoul left a comment •

edited

Loading

malfet commented Nov 17, 2020 •

edited

Loading

malfet commented Nov 17, 2020 •

edited

Loading

henryiii commented Nov 24, 2020 •

edited

Loading

henryiii commented Dec 16, 2020 •

edited

Loading