Skip to content

Add noexcept(false) to destructors for gil_scoped_release #2215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ezyang opened this issue May 15, 2020 · 12 comments
Open

Add noexcept(false) to destructors for gil_scoped_release #2215

ezyang opened this issue May 15, 2020 · 12 comments

Comments

@ezyang
Copy link

ezyang commented May 15, 2020

When a thread attempts to acquire the GIL but the Python interpreter has already destructed, Python will attempt to terminate the thread using pthread_exit. In many implementations of pthread_exit, this will trigger a stack unwinding, which will immediately call std::terminate if you are inside a destructor with noexcept(true). Which is the case for the destructor of gil_scoped_release, which will attempt to acquire the GIL on reentry. The net effect of this any code which uses gil_scoped_release and is called from a daemon thread from Python is likely to cause your process to unceremoniously exit.

The fix seems to be quite simple:

diff --git a/include/pybind11/pybind11.h b/include/pybind11/pybind11.h
index 7fa0f0e..eb501a7 100644
--- a/include/pybind11/pybind11.h
+++ b/include/pybind11/pybind11.h
@@ -1954,7 +1954,7 @@ public:
             PYBIND11_TLS_DELETE_VALUE(key);
         }
     }
-    ~gil_scoped_release() {
+    ~gil_scoped_release() noexcept(false) {
         if (!tstate)
             return;
         PyEval_RestoreThread(tstate);

Do you agree with this change? If so I can submit a PR for it. More context: pytorch/pytorch#38228

It would be reasonably simple to produce a repro that doesn't involve PyTorch, please let me know if that would be helpful.

@bstaletic
Copy link
Collaborator

When a thread attempts to acquire the GIL but the Python interpreter has already destructed

This is always a bug, which makes me wonder if noexcept(false) is a good idea here.

@wjakob
Copy link
Member

wjakob commented Jul 7, 2020

This is kind of a ugly situation but sometimes unavoidable in larger projects. The noexcept modifier would be okay for me, but perhaps add an extra comment (/* enable stack unwinding when the interpreter is shutting down */).

@bstaletic
Copy link
Collaborator

I'm just thinking that what is undoubtedly a bug should make a big and loud crash, so that detecting such a bug is easier.

@ezyang
Copy link
Author

ezyang commented Jul 7, 2020

This is always a bug, which makes me wonder if noexcept(false) is a good idea here.

I think it is very, very difficult to avoid these situations in some cases. Remember that we're not purposely trying to acquire the GIL when Python is dead: it is the destructor for a gil_scoped_release in the stack of one of your worker threads. To shut down "properly" you would have to arrange for all of the worker threads to shut down before the Python interpreter shuts down. This is incredibly fiddly and you don't even necessarily have control of all the threads in your system.

I am generally in favor of clean destruction, especially when it makes running tools like valgrind and sanitizers easier, but in this particular case the payoff doesn't seem worth it.

@bstaletic
Copy link
Collaborator

I'm still not convinced that noexcept(false) is a good idea.

  1. pthread_exit is a C API, so it shouldn't have anything to do with C++ exceptions. Please correct me if I'm wrong about this.
  2. Assuming that noexcept(false) does help, I really really don't like how it can silently hide bugs that are potentially very bad.
  3. Again, correct me if I'm wrong, but if pthread_exit is C, does it actually unwind the stack in the same sense as a C++ exception would?
  4. I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

@ezyang
Copy link
Author

ezyang commented Jul 8, 2020

Again, correct me if I'm wrong, but if pthread_exit is C, does it actually unwind the stack in the same sense as a C++ exception would?

In some implementations, yes. For example: https://stackoverflow.com/questions/11452546/why-does-pthread-exit-throw-something-caught-by-ellipsis

It is indeed pretty naughty to throw an exception from a C API, but this is the sort of thing a compiler is allowed to do, because it's the one defining the ABI in the first place.

I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

I can test if this would solve our problem. I also don't like it either, since if you don't throw an exception from the destructor you'll exit out of the destructor into a context where you ought to have had the GIL, but you don't actually have it, and so the failure happens at a later point in time. Forcing an unwinding is probably preferable since it makes the situation more obvious. But I am not too sure.

@YannickJadoul
Copy link
Collaborator

I also don't like it either, since if you don't throw an exception from the destructor you'll exit out of the destructor into a context where you ought to have had the GIL

But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

and so the failure happens at a later point in time. Forcing an unwinding is probably preferable since it makes the situation more obvious.

I think this is @bstaletic's argument on why to just have the whole program crash directly anyway, and not maybe catch some exception thrown by pthread_exit or similar. Crash immediately where the mistake happens, rather than hiding the error.

@ezyang
Copy link
Author

ezyang commented Jul 8, 2020

But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

In our particular case, we know everything's going to be all right, because our top level loop looks like:

void PythonEngine::thread_init(int device, const std::shared_ptr<ReadyQueue>& ready_queue, bool should_increment) {
  // Increment thread usage count before acquiring the GIL
  if (should_increment) {
    increment_non_reentrant_thread_count();
  }
  // Create a PyThreadState, but release the GIL. This lets pybind11::gil_scoped_acquire calls
  // inside thread_main acquire the GIL without having to create a new
  // PyThreadState each time.
  pybind11::gil_scoped_acquire gil;
  pybind11::gil_scoped_release no_gil;
  Engine::thread_init(device, ready_queue, false);

  if (should_increment) {
    // Decrement the count during shutdown if we incremented earlier.
    decrement_non_reentrant_thread_count();
  }
}

But I think I would agree that in general it's dangerous.

@colesbury do you find yourself convinced by the arguments here too?

@colesbury
Copy link
Contributor

Here is a self contained example, which I think will be a better reference than complicated PyTorch internals:

https://github.com/colesbury/pybind-exit-test

I'd like to emphasize that the standard Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS works fine, while gil_scoped_release crashes on exit.

The proposed change isn't perfect (I think nested sequences of gil_scoped_acquire/gil_scoped_release will probably still cause problems), but it's low risk and will fix the most common issue.


But Python is dead, at that moment, right? No one is supposed to be touching the GIL or anything else in Python, at that point!

Python is shutting down (usually). Re-acquiring the GIL from another thread is pretty common and behaves well when directly using the Python C API. Comments in the CPython code base suggest that this is explicitly supported.

Assuming that noexcept(false) does help, I really really don't like how it can silently hide bugs that are potentially very bad.

I don't think the proposed change is likely to hide any additional bugs. The called functions don't throw C++ exceptions on error.

I still don't like it, but what about predicating PyThread_RestorState(tstate) on Py_IsInitialized()?

That won't work. The thread will return into the Python interpreter without having acquired the GIL.

I'm just thinking that what is undoubtedly a bug should make a big and loud crash, so that detecting such a bug is easier.

Where do you think the bug lies? What change could the pybind-exit-test make to reliably avoid crashing?

@bstaletic
Copy link
Collaborator

bstaletic commented Jul 9, 2020

I'd like to emphasize that the standard Py_BEGIN_ALLOW_THREADS/Py_END_ALLOW_THREADS works fine, while gil_scoped_release crashes on exit.

The Py_BEGIN_ALLOW_THREADS version segfaults if you try to throw a different exception from read_pipe_capi(). I know the example you gave doesn't use exception inside that function, but in the general case, a different user might.

Python is shutting down (usually). Re-acquiring the GIL from another thread is pretty common and behaves well when directly using the Python C API. Comments in the CPython code base suggest that this is explicitly supported.

Again, it segfaults if there's another exception already in flight.

Where do you think the bug lies? What change could the pybind-exit-test make to reliably avoid crashing?

Calling gc.collect() and t.join() just before you exit works.

EDIT: Just executing gc.collect() makes things work fine.

@colesbury
Copy link
Contributor

The Py_BEGIN_ALLOW_THREADS version segfaults if you try to throw a different exception from read_pipe_capi(). I know the example you gave doesn't use exception inside that function, but in the general case, a different user might.

As you point out, the example doesn't use C++ exceptions so it doesn't handle them. You can change it to handle C++ exceptions if you like, but I don't think it's relevant to this issue.

Calling gc.collect() and t.join() just before you exit works.

Neither of those are reliable fixes. In real programs, the interpreter shutdown (Py_FinalizeEx) can take many milliseconds or longer: imported modules need to be finalized and the garbage collector is run on the full heap. In toy programs, the shutdown is usually quick. To reproduce this issue, it's important that the ~gil_scoped_release() call overlaps the Py_FinalizeEx() execution in the main thread. The example adds an artificial delay to Py_FinalizeEx() (via GC) to help reproduce this issue. Moving the artificial delay earlier (by gc.collect()) defeats its purpose. Real programs have no mechanism to ensure that Py_FinalizeEx() finishes quickly.

@wjakob
Copy link
Member

wjakob commented Jul 9, 2020

There are still too many unknowns for me to seriously consider this PR.

What I've understood from this discussion: accessing the GIL when Python has shut down causes it to call pthread_exit() -- this seems very bad to me, but it's in CPython and not something that we can control here.

Then GCC/libc raises __forced_unwind (a C-style exception?), which C++ seems to understand and be able to catch/propagate (this strikes me as something heavily compiler/platform-dependent). And you want it to be propagated rather than crashing in the destructor, where exceptions are forbidden.

But that is just one very specific platform: what about Clang with libc++/libc++abi (it doesn't even know about abi::forced_unwind), what about MacOS, what about MSVC? How is the unwind happening there?

To be honest, this seems to me like a case where it would be good to take a step back and see whether there isn't perhaps an alternative architecture that avoids these kinds of situations altogether (which reek of undefined behavior even with this hypothetical change).

If you really cannot avoid it at all, then I would suggest that you create your own flavor of gil_scoped_release in PyTorch that prods Python to see if it is still alive before trying to release the GIL (and involving nasty C-style exceptions being thrown from pthread_exit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants