Add noexcept(false) to destructors for gil_scoped_release #2215

ezyang · 2020-05-15T14:59:39Z

When a thread attempts to acquire the GIL but the Python interpreter has already destructed, Python will attempt to terminate the thread using pthread_exit. In many implementations of pthread_exit, this will trigger a stack unwinding, which will immediately call std::terminate if you are inside a destructor with noexcept(true). Which is the case for the destructor of gil_scoped_release, which will attempt to acquire the GIL on reentry. The net effect of this any code which uses gil_scoped_release and is called from a daemon thread from Python is likely to cause your process to unceremoniously exit.

The fix seems to be quite simple:

diff --git a/include/pybind11/pybind11.h b/include/pybind11/pybind11.h
index 7fa0f0e..eb501a7 100644
--- a/include/pybind11/pybind11.h
+++ b/include/pybind11/pybind11.h
@@ -1954,7 +1954,7 @@ public:
             PYBIND11_TLS_DELETE_VALUE(key);
         }
     }
-    ~gil_scoped_release() {
+    ~gil_scoped_release() noexcept(false) {
         if (!tstate)
             return;
         PyEval_RestoreThread(tstate);

Do you agree with this change? If so I can submit a PR for it. More context: pytorch/pytorch#38228

It would be reasonably simple to produce a repro that doesn't involve PyTorch, please let me know if that would be helpful.

The text was updated successfully, but these errors were encountered:

bstaletic · 2020-07-07T08:47:42Z

When a thread attempts to acquire the GIL but the Python interpreter has already destructed

This is always a bug, which makes me wonder if noexcept(false) is a good idea here.

wjakob · 2020-07-07T09:00:58Z

This is kind of a ugly situation but sometimes unavoidable in larger projects. The noexcept modifier would be okay for me, but perhaps add an extra comment (/* enable stack unwinding when the interpreter is shutting down */).

bstaletic · 2020-07-07T09:09:06Z

I'm just thinking that what is undoubtedly a bug should make a big and loud crash, so that detecting such a bug is easier.

ezyang · 2020-07-07T21:55:40Z

This is always a bug, which makes me wonder if noexcept(false) is a good idea here.

I think it is very, very difficult to avoid these situations in some cases. Remember that we're not purposely trying to acquire the GIL when Python is dead: it is the destructor for a gil_scoped_release in the stack of one of your worker threads. To shut down "properly" you would have to arrange for all of the worker threads to shut down before the Python interpreter shuts down. This is incredibly fiddly and you don't even necessarily have control of all the threads in your system.

I am generally in favor of clean destruction, especially when it makes running tools like valgrind and sanitizers easier, but in this particular case the payoff doesn't seem worth it.

bstaletic · 2020-07-08T12:07:50Z

I'm still not convinced that noexcept(false) is a good idea.

pthread_exit is a C API, so it shouldn't have anything to do with C++ exceptions. Please correct me if I'm wrong about this.
Assuming that noexcept(false) does help, I really really don't like how it can silently hide bugs that are potentially very bad.
Again, correct me if I'm wrong, but if pthread_exit is C, does it actually unwind the stack in the same sense as a C++ exception would?
I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

ezyang · 2020-07-08T19:17:54Z

Again, correct me if I'm wrong, but if pthread_exit is C, does it actually unwind the stack in the same sense as a C++ exception would?

In some implementations, yes. For example: https://stackoverflow.com/questions/11452546/why-does-pthread-exit-throw-something-caught-by-ellipsis

It is indeed pretty naughty to throw an exception from a C API, but this is the sort of thing a compiler is allowed to do, because it's the one defining the ABI in the first place.

I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

I can test if this would solve our problem. I also don't like it either, since if you don't throw an exception from the destructor you'll exit out of the destructor into a context where you ought to have had the GIL, but you don't actually have it, and so the failure happens at a later point in time. Forcing an unwinding is probably preferable since it makes the situation more obvious. But I am not too sure.

YannickJadoul · 2020-07-08T19:34:14Z

I also don't like it either, since if you don't throw an exception from the destructor you'll exit out of the destructor into a context where you ought to have had the GIL

But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

and so the failure happens at a later point in time. Forcing an unwinding is probably preferable since it makes the situation more obvious.

I think this is @bstaletic's argument on why to just have the whole program crash directly anyway, and not maybe catch some exception thrown by pthread_exit or similar. Crash immediately where the mistake happens, rather than hiding the error.

ezyang · 2020-07-08T20:45:13Z

But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

In our particular case, we know everything's going to be all right, because our top level loop looks like:

void PythonEngine::thread_init(int device, const std::shared_ptr<ReadyQueue>& ready_queue, bool should_increment) {
  // Increment thread usage count before acquiring the GIL
  if (should_increment) {
    increment_non_reentrant_thread_count();
  }
  // Create a PyThreadState, but release the GIL. This lets pybind11::gil_scoped_acquire calls
  // inside thread_main acquire the GIL without having to create a new
  // PyThreadState each time.
  pybind11::gil_scoped_acquire gil;
  pybind11::gil_scoped_release no_gil;
  Engine::thread_init(device, ready_queue, false);

  if (should_increment) {
    // Decrement the count during shutdown if we incremented earlier.
    decrement_non_reentrant_thread_count();
  }
}

But I think I would agree that in general it's dangerous.

@colesbury do you find yourself convinced by the arguments here too?

colesbury · 2020-07-09T00:21:02Z

Here is a self contained example, which I think will be a better reference than complicated PyTorch internals:

https://github.com/colesbury/pybind-exit-test

I'd like to emphasize that the standard Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS works fine, while gil_scoped_release crashes on exit.

The proposed change isn't perfect (I think nested sequences of gil_scoped_acquire/gil_scoped_release will probably still cause problems), but it's low risk and will fix the most common issue.

But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

Python is shutting down (usually). Re-acquiring the GIL from another thread is pretty common and behaves well when directly using the Python C API. Comments in the CPython code base suggest that this is explicitly supported.

Assuming that noexcept(false) does help, I really really don't like how it can silently hide bugs that are potentially very bad.

I don't think the proposed change is likely to hide any additional bugs. The called functions don't throw C++ exceptions on error.

I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

That won't work. The thread will return into the Python interpreter without having acquired the GIL.

I'm just thinking that what is undoubtedly a bug should make a big and loud crash, so that detecting such a bug is easier.

Where do you think the bug lies? What change could the pybind-exit-test make to reliably avoid crashing?

bstaletic · 2020-07-09T04:04:04Z

I'd like to emphasize that the standard Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS works fine, while gil_scoped_release crashes on exit.

The Py_BEGIN_ALLOW_THREADS version segfaults if you try to throw a different exception from read_pipe_capi(). I know the example you gave doesn't use exception inside that function, but in the general case, a different user might.

Python is shutting down (usually). Re-acquiring the GIL from another thread is pretty common and behaves well when directly using the Python C API. Comments in the CPython code base suggest that this is explicitly supported.

Again, it segfaults if there's another exception already in flight.

Where do you think the bug lies? What change could the pybind-exit-test make to reliably avoid crashing?

Calling gc.collect() and t.join() just before you exit works.

EDIT: Just executing gc.collect() makes things work fine.

colesbury · 2020-07-09T15:18:40Z

The Py_BEGIN_ALLOW_THREADS version segfaults if you try to throw a different exception from read_pipe_capi(). I know the example you gave doesn't use exception inside that function, but in the general case, a different user might.

As you point out, the example doesn't use C++ exceptions so it doesn't handle them. You can change it to handle C++ exceptions if you like, but I don't think it's relevant to this issue.

Calling gc.collect() and t.join() just before you exit works.

Neither of those are reliable fixes. In real programs, the interpreter shutdown (Py_FinalizeEx) can take many milliseconds or longer: imported modules need to be finalized and the garbage collector is run on the full heap. In toy programs, the shutdown is usually quick. To reproduce this issue, it's important that the ~gil_scoped_release() call overlaps the Py_FinalizeEx() execution in the main thread. The example adds an artificial delay to Py_FinalizeEx() (via GC) to help reproduce this issue. Moving the artificial delay earlier (by gc.collect()) defeats its purpose. Real programs have no mechanism to ensure that Py_FinalizeEx() finishes quickly.

wjakob · 2020-07-09T19:53:45Z

There are still too many unknowns for me to seriously consider this PR.

What I've understood from this discussion: accessing the GIL when Python has shut down causes it to call pthread_exit() -- this seems very bad to me, but it's in CPython and not something that we can control here.

Then GCC/libc raises __forced_unwind (a C-style exception?), which C++ seems to understand and be able to catch/propagate (this strikes me as something heavily compiler/platform-dependent). And you want it to be propagated rather than crashing in the destructor, where exceptions are forbidden.

But that is just one very specific platform: what about Clang with libc++/libc++abi (it doesn't even know about abi::forced_unwind), what about MacOS, what about MSVC? How is the unwind happening there?

To be honest, this seems to me like a case where it would be good to take a step back and see whether there isn't perhaps an alternative architecture that avoids these kinds of situations altogether (which reek of undefined behavior even with this hypothetical change).

If you really cannot avoid it at all, then I would suggest that you create your own flavor of gil_scoped_release in PyTorch that prods Python to see if it is still alive before trying to release the GIL (and involving nasty C-style exceptions being thrown from pthread_exit).

ezyang mentioned this issue May 15, 2020

pybind11::gil_scoped_release: crash on exit with daemon threads pytorch/pytorch#38228

Open

ezyang mentioned this issue Nov 9, 2020

segfault during shutdown with torch1.7 pytorch/pytorch#47161

Open

bstaletic mentioned this issue Nov 17, 2020

Avoid thread termination in scoped_released #2657

Merged

virtuald mentioned this issue Apr 20, 2022

Sendable::InitSendable: force builder to be a reference robotpy/robotpy-wpiutil#40

Merged

rwgk mentioned this issue Feb 10, 2023

FWD pybind11 google/pybind11clif#2215

Closed

rocodes mentioned this issue Jun 11, 2024

Order list of PDF viewers and return default application first (Linux) freedomofpress/dangerzone#832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add noexcept(false) to destructors for gil_scoped_release #2215

Add noexcept(false) to destructors for gil_scoped_release #2215

ezyang commented May 15, 2020

bstaletic commented Jul 7, 2020

wjakob commented Jul 7, 2020

bstaletic commented Jul 7, 2020

ezyang commented Jul 7, 2020

bstaletic commented Jul 8, 2020

ezyang commented Jul 8, 2020

YannickJadoul commented Jul 8, 2020

ezyang commented Jul 8, 2020

colesbury commented Jul 9, 2020

bstaletic commented Jul 9, 2020 •

edited

Loading

colesbury commented Jul 9, 2020

wjakob commented Jul 9, 2020

Add noexcept(false) to destructors for gil_scoped_release #2215

Add noexcept(false) to destructors for gil_scoped_release #2215

Comments

ezyang commented May 15, 2020

bstaletic commented Jul 7, 2020

wjakob commented Jul 7, 2020

bstaletic commented Jul 7, 2020

ezyang commented Jul 7, 2020

bstaletic commented Jul 8, 2020

ezyang commented Jul 8, 2020

YannickJadoul commented Jul 8, 2020

ezyang commented Jul 8, 2020

colesbury commented Jul 9, 2020

bstaletic commented Jul 9, 2020 • edited Loading

colesbury commented Jul 9, 2020

wjakob commented Jul 9, 2020

bstaletic commented Jul 9, 2020 •

edited

Loading