Skip to content

Sporadic segfaults in TensorFlow during PyObject_GC_UnTrack when used through PyO3 #1623

@awestlake87

Description

@awestlake87

I've been seeing some SEGV errors when calling TensorFlow ops through PyO3. I haven't been able to find a solid pattern for when they occur. As far as I can tell it's pretty random, although sometimes I can find a sweet spot by rearranging or splitting up some calls.

The backtrace consistently starts with the following frames:

#0  0x00007f9d45b5e55f in PyObject_GC_UnTrack () from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0
#1  0x00007f9c9affb8df in EagerTensor_dealloc () from /usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#2  0x00005555b0f8b27f in pyo3::ffi::object::Py_DECREF (op=0x7f9c8dfe3b40) at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/ffi/object.rs:825
#3  0x00005555b0f83396 in pyo3::gil::ReferencePool::update_counts (self=0x5555b1a65eb0 <pyo3::gil::POOL>, _py=...)
    at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:357
#4  0x00005555b0f834c9 in pyo3::gil::GILPool::new () at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:386
#5  0x00005555b0f82cff in pyo3::gil::GILGuard::acquire () at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:267
#6  0x00005555b0f83b99 in pyo3::gil::ensure_gil () at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/gil.rs:490
#7  0x00005555b00e1be1 in pyo3::python::Python::with_gil (f=...) at /opt/.cargo/registry/src/git.colasdn.top-1ecc6299db9ec823/pyo3-0.13.2/src/python.rs:157

For reference, that EagerTensor_dealloc function is defined here:

// tp_dealloc for EagerTensor.
void EagerTensor_dealloc(EagerTensor* self) {
  // Unhook the object from python's GC so that the weakref deleter doesn't
  // try to re-delete this.
  PyObject_GC_UnTrack((PyObject*)self);

  // Clear weak references to self.
  // Needs to happen before any actual destruction.
  PyObject_ClearWeakRefs((PyObject*)self);

  Py_DECREF(self->handle_data);
  Py_DECREF(self->tensor_shape);
  // If an attribute dictionary has been created, release it. Note that this
  // is only ever created by CPython's attribute setting methods; we don't
  // create it ourselves.
  Py_CLEAR(self->dict);
  if (self->handle != nullptr) {
    TFE_DeleteTensorHandle(self->handle);
    self->handle = nullptr;
  }

  // Decref context after deleting the tensor handle.
  Py_XDECREF(self->context);

  // We have the global interpreter lock, so use this chance to perform delayed
  // refcount decrements.
  tensorflow::ClearDecrefCache();
  auto id = self->id;
  Py_TYPE(self)->tp_free(self);
  TFE_Py_TapeSetDeleteTrace(id);
}

This may not be the right spot to file this issue, so don't feel obligated to help with this if it doesn't seem related to PyO3. I just thought I'd check with you guys to see if these snippets raise any red flags.

Environment
Docker image: nvidia/cuda11.0-base-ubuntu20.04
Python 3.8.5
PyO3 v0.13.2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions