-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Blocking destructors and the GIL #1446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Deadlock issues can get tricky fast; keeping the responsibility for sorting the locking on the part of the user doing the binding code seems less messy than an approach that works sometimes but can fail badly at other times. In your particular example, the deadlock here is coming from trying to For the case where your class is pybind11-aware (which is will be if it contains For binding an external class where intrusion isn't an option, an alternative that ought to work is to use a custom deleter that deletes with the gil released. This is basically a user-side implementation of your // Version for shared_ptr:
template <typename T> void destroy_without_gil(T *ptr) {
pybind11::gil_scoped_release nogil;
delete ptr;
}
// version for unique_ptr:
template <typename T> struct unique_ptr_nogil_deleter {
void operator()(T *ptr) {
pybind11::gil_scoped_release nogil;
delete ptr;
}
};
struct UniqueWorker : Worker {}; // Same thing, distinct type for the sake of the example
// ...
// in binding code:
PYBIND11_EMBEDDED_MODULE(deadlock, mod) {
pybind11::class_<Worker, std::shared_ptr<Worker>>(mod, "Worker");
// Alternative using a unique_ptr holder:
pybind11::class_<UniqueWorker, std::unique_ptr<UniqueWorker, unique_ptr_nogil_deleter<UniqueWorker>>>(mod, "UniqueWorker");
}
// ...
// in later code, e.g. in your `main()`:
dict["worker"] = pybind11::cast(std::shared_ptr<Worker>(new Worker(), destroy_without_gil<Worker>));
dict["unique"] = pybind11::cast(new UniqueWorker(), pybind11::return_value_policy::take_ownership); And here's a complete modified example with all three approaches in use (plus some other modifications to see the precise order and GIL state of destructions): #include <pybind11/pybind11.h>
#include <pybind11/embed.h>
#include <atomic>
#include <thread>
#include <iostream>
using namespace std::chrono_literals;
struct Subobject {
std::string parent;
Subobject(const std::string &parent) : parent(parent) {}
~Subobject() {
std::cout << "Subobject of " << parent << " destroyed with GIL " << (PyGILState_Check() ? "held" : "released") << "\n";
}
};
// A worker that runs some Python code in a separate thread
struct Worker {
Worker(std::string n) : name(std::move(n)), subobject(name) {
thread = std::thread([this] {
while (keepRunning) {
pybind11::gil_scoped_acquire gil;
pybind11::print(this->name + " working");
std::this_thread::sleep_for(10ms);
}
});
}
~Worker() {
if (PyGILState_Check()) {
std::cout << name << " worker destroyed with GIL held; releasing it\n";
pybind11::gil_scoped_release nogil;
keepRunning = false;
if (thread.joinable()) {
thread.join();
}
} else {
std::cout << name << " worker destroyed without GIL held\n";
keepRunning = false;
if (thread.joinable()) {
thread.join();
}
}
}
std::thread thread;
std::string name;
std::atomic<bool> keepRunning;
Subobject subobject;
};
struct WorkerUnique : public Worker { using Worker::Worker; };
template <typename T> void destroy_without_gil(T *ptr) {
pybind11::gil_scoped_release nogil;
delete ptr;
}
template <typename T> struct unique_ptr_nogil_deleter {
void operator()(T *ptr) { destroy_without_gil(ptr); }
};
PYBIND11_EMBEDDED_MODULE(deadlock, mod) {
pybind11::class_<Worker, std::shared_ptr<Worker>>(mod, "Worker");
pybind11::class_<WorkerUnique, std::unique_ptr<WorkerUnique, unique_ptr_nogil_deleter<WorkerUnique>>>(mod, "WorkerUnique");
}
int main() {
pybind11::scoped_interpreter interpreter;
pybind11::module::import("deadlock");
{
pybind11::dict dict;
dict["worker"] = pybind11::cast(std::shared_ptr<Worker>(new Worker("shared_ptr_no_gil"), destroy_without_gil<Worker>));
dict["worker2"] = pybind11::cast(std::shared_ptr<Worker>(new Worker("shared_ptr")));
dict["worker_unique"] = pybind11::cast(new WorkerUnique("unique_ptr"), pybind11::return_value_policy::take_ownership);
{
// Let the worker run for a while
pybind11::gil_scoped_release release;
std::this_thread::sleep_for(100ms);
}
}
// This line will rarely be reached due to a deadlock when destroying dict
pybind11::print("No deadlock");
} output (notice, in particular, that the intrusive solution destroys the subobject with the GIL held, while the others don't):
|
Thanks for the suggestion of using custom deleters on the smart pointers, I hadn't thought of that. However, in my case, some objects are already wrapped in shared_ptr by the non-Python-aware code, so there's really no opportunity to set the deleter without intrusive changes. |
Perhaps you could make your own wrapper for the shared_ptr that releases the GIL and the shared_ptr during destruction, something like:
|
The ThreadPool owned by DNP3Manager causes a deadlock when it is destroyed without unlocking the GIL. By returning the DNP3Manager with a custom deleter which unlocks the GIL during destruction the pool can be deleted. Solution borrowed from: pybind/pybind11#1446
@patstew Does your code actually work as written? |
I posted a complete, working holder type example at #2957 |
It looks like pybind11 doesn't guarantee that holder types are only destructed in contexts where the GIL is held, so you also have to check if you hold the GIL or not in the destructor. My modified code:
|
@bkloster-sma's comment raises a second problem with the holder approach: even if you do have Python bindings for which you can set a custom deleter for the shared pointer, if there are other sites in your code where the object can be allocated without the destructor, you could still end up passing the object to Python and having Python be the last owner of the object in question (bypassing the destructor). (This can't happen for unique pointer because the destructor is bound up in the type). This should pretty rare, hopefully, but it annoyingly suggests the only way to be totally sure is release the GIL in the destructor of the C++ class itself, or have a specialized holder type for pybind11 which ensures that the owning reference has an appropriate destructor... |
The ThreadPool owned by DNP3Manager causes a deadlock when it is destroyed without unlocking the GIL. By returning the DNP3Manager with a custom deleter which unlocks the GIL during destruction the pool can be deleted. Solution borrowed from: pybind/pybind11#1446
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
### Details: - Initial problem: `test_custom_op` hanged on destruction because it was waiting for a thread which tried to acquire GIL. - The second problem is that pybind11 doesn't allow to work with GIL besides of current scope and it's impossible to release GIL for destructors. pybind/pybind11#1446 - Current solution allows to release GIL for InferRequest and all called by chain destructors. ### Tickets: - CVS-141744
@rwgk is there any reason why your changes couldn't be merged back into pybind11? |
This one? |
Correct. It looks like you merged it into the google fork but not into vanilla pybind11. |
Non-technical reasons, mostly very/extremely slow reviews in general. I'll try to port it to master here soon and I'll tag you there for a review. |
Thank you. The least intrusive workaround I have found is a custom holder that wraps the desired smart pointer like patstew outlined above but their example has a few bugs. #pragma once
#include <pybind11/pybind11.h>
#include <memory>
// A custom shared ptr that releases the GIL before freeing the resource.
template <typename T>
class nogil_shared_ptr {
private:
std::shared_ptr<T> ptr;
public:
template <typename... Args>
nogil_shared_ptr(Args&&... args)
: ptr(std::forward<Args>(args)...)
{
}
~nogil_shared_ptr()
{
pybind11::gil_scoped_release nogil;
ptr.reset();
}
T& operator*() const noexcept { return *ptr; }
T* operator->() const noexcept { return ptr.get(); }
operator std::shared_ptr<T>() const noexcept { return ptr; }
T* get() const noexcept { return ptr.get(); }
};
PYBIND11_DECLARE_HOLDER_TYPE(T, nogil_shared_ptr<T>) |
That's all I'll need, thanks. That particular PR was reviewed by Google engineers and it was in Google's production software stack. It passed millions of unit tests through PyCLIF-pybind11 testing. |
Issue description
pybind11 allows releasing the GIL for pretty much any bound function, including constructors, but not for destructors. Besides a missed opportunity for optimizing GIL usage, this can easily cause deadlocks in certain situations. Whenever a destructor waits for another thread, and this thread tries to lock the GIL (because it needs to run Python code, or otherwise wants to work with Python objects), a deadlock occurs.
The sample program at the bottom demonstrates this problem. Destroying the dictionary triggers the destructor of the
Worker
, causing a deadlock more often than not. Obviously,~Worker()
does not have to keep the GIL locked, and explicitly releasing it before callingjoin()
will resolve the deadlock. However, this is not always a desirable solution, because it means inserting Python calls invasively into a codebase (basically into any destructor that may block).Are there any agreed upon strategies to deal with this problem?
Possible solutions
If there isn't a common solution to this deadlock, I would like to propose a couple of options.
delete_without_gil
Add a new option to the
class_
template,delete_without_gil
. While deallocating objects of such classes, pybind11 will release the GIL.[EDIT 2024-01-18: This was implemented under https://github.com/google/pybind11clif/pull/30088]
This is a straight-forward, but not a complete solution. The "blocking" property of destructors is transitive through the class' members. When pybind11 destroys an object of type
A
, but this object has a member of typeB
whose destructor blocks,A
also has to be markeddelete_without_gil
. What's worse, if~B()
originally starts out as non-blocking, but is later changed to be blocking, all classes that have aB
member need to retroactively be markeddelete_without_gil
. Not to mention the case whereB
is polymorph, and someone unwittingly implements a new subclass with a blocking destructor.In short, bindings for complex codebases may need to always specify
delete_without_gil
to be on the safe side.[EDIT 2024-01-18: This is exactly how PyCLIF works. The new PyCLIF-pybind11 version will have the same behavior.]
Always release the GIL during deallocation
This would prevent the deadlock pretty decisively, but objects holding Python objects (e.g.
pybind11::dict
) as members will have to take care to reacquire the GIL before destroying them. Furthermore, the GIL may thrash during destruction of a complex object hierarchy, introducing a performance penalty.It may be prudent to allow toggling this option through a preprocessor flag. Bindings that require it and can live with the additional GIL overhead can enable it, while simpler modules can leave it as is.
Reproducible example code
This sample will start a worker executing some Python code (simple print statements) in a separate thread, which it needs the GIL for. Upon destruction of the worker, the thread is joined. If, as is the case here, the worker is destroyed while the GIL is locked, a deadlock occurs.
The text was updated successfully, but these errors were encountered: