Skip to content

Blocking destructors and the GIL #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bkloster-sma opened this issue Jul 5, 2018 · 13 comments · Fixed by #5522
Closed

Blocking destructors and the GIL #1446

bkloster-sma opened this issue Jul 5, 2018 · 13 comments · Fixed by #5522

Comments

@bkloster-sma
Copy link

bkloster-sma commented Jul 5, 2018

Issue description

pybind11 allows releasing the GIL for pretty much any bound function, including constructors, but not for destructors. Besides a missed opportunity for optimizing GIL usage, this can easily cause deadlocks in certain situations. Whenever a destructor waits for another thread, and this thread tries to lock the GIL (because it needs to run Python code, or otherwise wants to work with Python objects), a deadlock occurs.

The sample program at the bottom demonstrates this problem. Destroying the dictionary triggers the destructor of the Worker, causing a deadlock more often than not. Obviously, ~Worker() does not have to keep the GIL locked, and explicitly releasing it before calling join() will resolve the deadlock. However, this is not always a desirable solution, because it means inserting Python calls invasively into a codebase (basically into any destructor that may block).

Are there any agreed upon strategies to deal with this problem?

Possible solutions

If there isn't a common solution to this deadlock, I would like to propose a couple of options.

delete_without_gil

Add a new option to the class_ template, delete_without_gil. While deallocating objects of such classes, pybind11 will release the GIL.

[EDIT 2024-01-18: This was implemented under https://github.com/google/pybind11clif/pull/30088]

This is a straight-forward, but not a complete solution. The "blocking" property of destructors is transitive through the class' members. When pybind11 destroys an object of type A, but this object has a member of type B whose destructor blocks, A also has to be marked delete_without_gil. What's worse, if ~B() originally starts out as non-blocking, but is later changed to be blocking, all classes that have a B member need to retroactively be marked delete_without_gil. Not to mention the case where B is polymorph, and someone unwittingly implements a new subclass with a blocking destructor.

In short, bindings for complex codebases may need to always specify delete_without_gil to be on the safe side.

[EDIT 2024-01-18: This is exactly how PyCLIF works. The new PyCLIF-pybind11 version will have the same behavior.]

Always release the GIL during deallocation

This would prevent the deadlock pretty decisively, but objects holding Python objects (e.g. pybind11::dict) as members will have to take care to reacquire the GIL before destroying them. Furthermore, the GIL may thrash during destruction of a complex object hierarchy, introducing a performance penalty.

It may be prudent to allow toggling this option through a preprocessor flag. Bindings that require it and can live with the additional GIL overhead can enable it, while simpler modules can leave it as is.

Reproducible example code

This sample will start a worker executing some Python code (simple print statements) in a separate thread, which it needs the GIL for. Upon destruction of the worker, the thread is joined. If, as is the case here, the worker is destroyed while the GIL is locked, a deadlock occurs.

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

#include <atomic>
#include <thread>

using namespace std::chrono_literals;

// A worker that runs some Python code in a separate thread
struct Worker {

    Worker() {
        thread = std::thread([this] {
            while (keepRunning) {
                pybind11::gil_scoped_acquire gil;
                pybind11::print("Working");
                std::this_thread::sleep_for(10ms);
            }
        });
    }

    ~Worker() {
        keepRunning = false;
        if (thread.joinable()) {
            thread.join();
        }
    }

    std::thread thread;

    std::atomic<bool> keepRunning;

};

PYBIND11_EMBEDDED_MODULE(deadlock, mod) {
    pybind11::class_<Worker>(mod, "Worker");
}

int main() {
    pybind11::scoped_interpreter interpreter;
    pybind11::module::import("deadlock");
    {
        pybind11::dict dict;
        dict["worker"] = pybind11::cast(new Worker(), pybind11::return_value_policy::take_ownership);
        {
            // Let the worker run for a while
            pybind11::gil_scoped_release release;
            std::this_thread::sleep_for(100ms);
        }
    }
    // This line will rarely be reached due to a deadlock when destroying dict
    pybind11::print("No deadlock");
}

@jagerman
Copy link
Member

Deadlock issues can get tricky fast; keeping the responsibility for sorting the locking on the part of the user doing the binding code seems less messy than an approach that works sometimes but can fail badly at other times.

In your particular example, the deadlock here is coming from trying to join() in the destructor; a fix for the example would be to throw a py::gil_scoped_release nogil; just before thread.join();.

For the case where your class is pybind11-aware (which is will be if it contains py::dict members or other classes that ultimately touch Python during destruction) this seems fine: the class is already pybind11-aware, and moreover this releases only in the destructor itself (but not over the entire destruction): destruction of subobjects will be done with the gil held — this is really the optimal solution, but is intrusive.

For binding an external class where intrusion isn't an option, an alternative that ought to work is to use a custom deleter that deletes with the gil released. This is basically a user-side implementation of your delete_without_gil approach without actually needing to add it into pybind11 itself, but has the same caveats (i.e. if destruction is going to hit Python code either directly or indirect you're in trouble). Here are a couple of implementations using a shared_ptr and unique_ptr with custom deleters:

// Version for shared_ptr:
template <typename T> void destroy_without_gil(T *ptr) {
    pybind11::gil_scoped_release nogil;
    delete ptr;
}

// version for unique_ptr:
template <typename T> struct unique_ptr_nogil_deleter {
    void operator()(T *ptr) {
        pybind11::gil_scoped_release nogil;
        delete ptr;
    }
};

struct UniqueWorker : Worker {}; // Same thing, distinct type for the sake of the example

// ...
// in binding code:
PYBIND11_EMBEDDED_MODULE(deadlock, mod) {
    pybind11::class_<Worker, std::shared_ptr<Worker>>(mod, "Worker");
    // Alternative using a unique_ptr holder:
    pybind11::class_<UniqueWorker, std::unique_ptr<UniqueWorker, unique_ptr_nogil_deleter<UniqueWorker>>>(mod, "UniqueWorker");
}

// ...
// in later code, e.g. in your `main()`:
    dict["worker"] = pybind11::cast(std::shared_ptr<Worker>(new Worker(), destroy_without_gil<Worker>));
    dict["unique"] = pybind11::cast(new UniqueWorker(), pybind11::return_value_policy::take_ownership);

And here's a complete modified example with all three approaches in use (plus some other modifications to see the precise order and GIL state of destructions):

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

#include <atomic>
#include <thread>
#include <iostream>

using namespace std::chrono_literals;

struct Subobject {
    std::string parent;
    Subobject(const std::string &parent) : parent(parent) {}
    ~Subobject() {
        std::cout << "Subobject of " << parent << " destroyed with GIL " << (PyGILState_Check() ? "held" : "released") << "\n";
    }
};


// A worker that runs some Python code in a separate thread
struct Worker {

    Worker(std::string n) : name(std::move(n)), subobject(name) {
        thread = std::thread([this] {
            while (keepRunning) {
                pybind11::gil_scoped_acquire gil;
                pybind11::print(this->name + " working");
                std::this_thread::sleep_for(10ms);
            }
        });
    }

    ~Worker() {
        if (PyGILState_Check()) {
            std::cout << name << " worker destroyed with GIL held; releasing it\n";
            pybind11::gil_scoped_release nogil;
            keepRunning = false;
            if (thread.joinable()) {
                thread.join();
            }
        } else {
            std::cout << name << " worker destroyed without GIL held\n";
            keepRunning = false;
            if (thread.joinable()) {
                thread.join();
            }
        }
    }

    std::thread thread;
    std::string name;
    std::atomic<bool> keepRunning;

    Subobject subobject;
};

struct WorkerUnique : public Worker { using Worker::Worker; };

template <typename T> void destroy_without_gil(T *ptr) {
    pybind11::gil_scoped_release nogil;
    delete ptr;
}

template <typename T> struct unique_ptr_nogil_deleter {
    void operator()(T *ptr) { destroy_without_gil(ptr); }
};


PYBIND11_EMBEDDED_MODULE(deadlock, mod) {
    pybind11::class_<Worker, std::shared_ptr<Worker>>(mod, "Worker");
    pybind11::class_<WorkerUnique, std::unique_ptr<WorkerUnique, unique_ptr_nogil_deleter<WorkerUnique>>>(mod, "WorkerUnique");
}

int main() {
    pybind11::scoped_interpreter interpreter;
    pybind11::module::import("deadlock");
    {
        pybind11::dict dict;
        dict["worker"] = pybind11::cast(std::shared_ptr<Worker>(new Worker("shared_ptr_no_gil"), destroy_without_gil<Worker>));
        dict["worker2"] = pybind11::cast(std::shared_ptr<Worker>(new Worker("shared_ptr")));
        dict["worker_unique"] = pybind11::cast(new WorkerUnique("unique_ptr"), pybind11::return_value_policy::take_ownership);
        {
            // Let the worker run for a while
            pybind11::gil_scoped_release release;
            std::this_thread::sleep_for(100ms);
        }
    }
    // This line will rarely be reached due to a deadlock when destroying dict
    pybind11::print("No deadlock");
}

output (notice, in particular, that the intrusive solution destroys the subobject with the GIL held, while the others don't):

shared_ptr_no_gil working
unique_ptr working
shared_ptr working
unique_ptr working
shared_ptr_no_gil working
unique_ptr working
shared_ptr working
unique_ptr working
shared_ptr working
shared_ptr_no_gil working
shared_ptr_no_gil worker destroyed without GIL held
unique_ptr working
shared_ptr_no_gil working
unique_ptr working
Subobject of shared_ptr_no_gil destroyed with GIL released
shared_ptr worker destroyed with GIL held; releasing it
shared_ptr working
unique_ptr working
unique_ptr working
Subobject of shared_ptr destroyed with GIL held
unique_ptr worker destroyed without GIL held
Subobject of unique_ptr destroyed with GIL released
No deadlock

@bkloster-sma
Copy link
Author

Thanks for the suggestion of using custom deleters on the smart pointers, I hadn't thought of that. However, in my case, some objects are already wrapped in shared_ptr by the non-Python-aware code, so there's really no opportunity to set the deleter without intrusive changes.

@patstew
Copy link
Contributor

patstew commented Aug 9, 2018

Perhaps you could make your own wrapper for the shared_ptr that releases the GIL and the shared_ptr during destruction, something like:

class Object;
shared_ptr<Object> unchangeable_func();

template<class T>
class Wrapper {
    shared_ptr<T> p;
    Wrapper(shared_ptr<T> a) : p(std::move(a)) {}
    ~Wrapper() {
        pybind11::gil_scoped_release release;
        p.reset();
    }
    T& operator*() { return *p; }
    T* operator->() { return p.get(); }
    operator shared_ptr<Object>() { return p; }
}

....

    pybind11::class_<Object, Wrapper<Object>>(mod, "Object");
    mod.def("unchangeable_func", [](){return Wrapper<Object>(unchangeable_func());});

shawgree pushed a commit to shawgree/pydnp3 that referenced this issue Oct 16, 2020
The ThreadPool owned by DNP3Manager causes a deadlock when it is destroyed without unlocking the GIL.
By returning the DNP3Manager with a custom deleter which unlocks the GIL during destruction the pool can be deleted.
Solution borrowed from: pybind/pybind11#1446
@ezyang
Copy link

ezyang commented Apr 16, 2021

@patstew Does your code actually work as written? class_ should reject Wrapper unless it is properly defined as PYBIND11_DECLARE_HOLDER_TYPE

@ezyang
Copy link

ezyang commented Apr 19, 2021

I posted a complete, working holder type example at #2957

@ezyang
Copy link

ezyang commented Apr 19, 2021

It looks like pybind11 doesn't guarantee that holder types are only destructed in contexts where the GIL is held, so you also have to check if you hold the GIL or not in the destructor. My modified code:

template <typename T>
class IntrusivePtrNoGilDestructor {
  c10::intrusive_ptr<T> impl_;
public:
  IntrusivePtrNoGilDestructor() = default;
  IntrusivePtrNoGilDestructor(const IntrusivePtrNoGilDestructor&) = default;
  IntrusivePtrNoGilDestructor(IntrusivePtrNoGilDestructor&&) = default;
  IntrusivePtrNoGilDestructor& operator=(const IntrusivePtrNoGilDestructor&) = default;
  IntrusivePtrNoGilDestructor& operator=(IntrusivePtrNoGilDestructor&&) = default;
  IntrusivePtrNoGilDestructor(c10::intrusive_ptr<T> impl) : impl_(std::move(impl)) {}
  // This ctor is very important; see
  // https://github.com/pybind/pybind11/issues/2957
  IntrusivePtrNoGilDestructor(T* impl) : impl_(c10::intrusive_ptr<T>::reclaim(impl)) {}
  ~IntrusivePtrNoGilDestructor() {
    if (impl_) {
      if (PyGILState_Check()) {
        pybind11::gil_scoped_release release;
        impl_.reset();
      } else {
        impl_.reset();
      }
    }
  }
  T& operator*() const noexcept { return *impl_; }
  T* operator->() const noexcept { return impl_.get(); }
  T* get() const noexcept { return impl_.get(); }
  void reset() noexcept { impl_.reset(); }
  operator bool() const noexcept {
    return impl_;
  } 
};
PYBIND11_DECLARE_HOLDER_TYPE(T, IntrusivePtrNoGilDestructor<T>, true);


@ezyang
Copy link

ezyang commented Apr 27, 2021

@bkloster-sma's comment raises a second problem with the holder approach: even if you do have Python bindings for which you can set a custom deleter for the shared pointer, if there are other sites in your code where the object can be allocated without the destructor, you could still end up passing the object to Python and having Python be the last owner of the object in question (bypassing the destructor). (This can't happen for unique pointer because the destructor is bound up in the type). This should pretty rare, hopefully, but it annoyingly suggests the only way to be totally sure is release the GIL in the destructor of the C++ class itself, or have a specialized holder type for pybind11 which ensures that the owning reference has an appropriate destructor...

zagor pushed a commit to zagor/dnp3-python that referenced this issue Oct 20, 2023
The ThreadPool owned by DNP3Manager causes a deadlock when it is destroyed without unlocking the GIL.
By returning the DNP3Manager with a custom deleter which unlocks the GIL during destruction the pool can be deleted.
Solution borrowed from: pybind/pybind11#1446
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 4, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 4, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 4, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 6, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 7, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
github-merge-queue bot pushed a commit to openvinotoolkit/openvino that referenced this issue Jun 7, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
alexandruenache1111 pushed a commit to alexandruenache1111/openvino that referenced this issue Jun 13, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
allnes pushed a commit to allnes/openvino that referenced this issue Jun 27, 2024
### Details:
- Initial problem: `test_custom_op` hanged on destruction because it was
waiting for a thread which tried to acquire GIL.
- The second problem is that pybind11 doesn't allow to work with GIL
besides of current scope and it's impossible to release GIL for
destructors. pybind/pybind11#1446
- Current solution allows to release GIL for InferRequest and all called
by chain destructors.

### Tickets:
 - CVS-141744
@gentlegiantJGC
Copy link
Contributor

@rwgk is there any reason why your changes couldn't be merged back into pybind11?
I would quite like this feature.

@rwgk
Copy link
Collaborator

rwgk commented Feb 13, 2025

@rwgk is there any reason why your changes couldn't be merged back into pybind11? I would quite like this feature.

This one?

@gentlegiantJGC
Copy link
Contributor

This one?

Correct. It looks like you merged it into the google fork but not into vanilla pybind11.
I was wondering if there is a reason why it wasn't.

@rwgk
Copy link
Collaborator

rwgk commented Feb 13, 2025

This one?

Correct. It looks like you merged it into the google fork but not into vanilla pybind11. I was wondering if there is a reason why it wasn't.

Non-technical reasons, mostly very/extremely slow reviews in general.

I'll try to port it to master here soon and I'll tag you there for a review.

@gentlegiantJGC
Copy link
Contributor

Thank you.
My knowledge of pybind11's internals is limited so I don't know how much help I can be but the implementation in that pull request looks good.

The least intrusive workaround I have found is a custom holder that wraps the desired smart pointer like patstew outlined above but their example has a few bugs.
Here is my improved version for anyone that wants it.

#pragma once

#include <pybind11/pybind11.h>

#include <memory>

// A custom shared ptr that releases the GIL before freeing the resource.
template <typename T>
class nogil_shared_ptr {
private:
    std::shared_ptr<T> ptr;

public:
    template <typename... Args>
    nogil_shared_ptr(Args&&... args)
        : ptr(std::forward<Args>(args)...)
    {
    }

    ~nogil_shared_ptr()
    {
        pybind11::gil_scoped_release nogil;
        ptr.reset();
    }

    T& operator*() const noexcept { return *ptr; }
    T* operator->() const noexcept { return ptr.get(); }
    operator std::shared_ptr<T>() const noexcept { return ptr; }
    T* get() const noexcept { return ptr.get(); }
};

PYBIND11_DECLARE_HOLDER_TYPE(T, nogil_shared_ptr<T>)

@rwgk
Copy link
Collaborator

rwgk commented Feb 13, 2025

My knowledge of pybind11's internals is limited so I don't know how much help I can be but the implementation in that pull request looks good.

That's all I'll need, thanks. That particular PR was reviewed by Google engineers and it was in Google's production software stack. It passed millions of unit tests through PyCLIF-pybind11 testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants