[mlir][py] invalidate nested operations when parent is deleted #93339

ftynse · 2024-05-24T19:56:25Z

When an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations from the "live" list when erasing the parent.

This also concentrates responsibility over the removal from the "live" list and invalidation in a single place.

Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do.

When an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations form the "live" list when erasing the parent. This also concentrates responsibility over the removal from the "live" list and invalidation in a single place. Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do.

llvmbot · 2024-05-24T19:56:54Z

@llvm/pr-subscribers-mlir

Author: Oleksandr "Alex" Zinenko (ftynse)

Changes

When an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations form the "live" list when erasing the parent.

This also concentrates responsibility over the removal from the "live" list and invalidation in a single place.

Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do.

Full diff: https://github.com/llvm/llvm-project/pull/93339.diff

3 Files Affected:

(modified) mlir/lib/Bindings/Python/IRCore.cpp (+22-13)
(modified) mlir/lib/Bindings/Python/IRModule.h (+7)
(added) mlir/test/python/live_operations.py (+46)

diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 01678a9719f90..f03c540d618cd 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -684,6 +684,17 @@ void PyMlirContext::clearOperationsInside(MlirOperation op) {
   clearOperationsInside(opRef->getOperation());
 }
 
+void PyMlirContext::clearOperationAndInside(PyOperationBase &op) {
+  MlirOperationWalkCallback invalidatingCallback = [](MlirOperation op,
+                                                      void *userData) {
+    PyMlirContextRef &contextRef = *static_cast<PyMlirContextRef *>(userData);
+    contextRef->clearOperation(op);
+    return MlirWalkResult::MlirWalkResultAdvance;
+  };
+  mlirOperationWalk(op.getOperation(), invalidatingCallback,
+                    &op.getOperation().getContext(), MlirWalkPreOrder);
+}
+
 size_t PyMlirContext::getLiveModuleCount() { return liveModules.size(); }
 
 pybind11::object PyMlirContext::contextEnter() {
@@ -1112,12 +1123,16 @@ PyOperation::~PyOperation() {
   // If the operation has already been invalidated there is nothing to do.
   if (!valid)
     return;
-  auto &liveOperations = getContext()->liveOperations;
-  assert(liveOperations.count(operation.ptr) == 1 &&
-         "destroying operation not in live map");
-  liveOperations.erase(operation.ptr);
-  if (!isAttached()) {
-    mlirOperationDestroy(operation);
+
+  // Otherwise, invalidate the operation and remove it from live map when it is
+  // attached.
+  if (isAttached()) {
+    getContext()->clearOperation(*this);
+  } else {
+    // And destroy it when it is detached, i.e. owned by Python, in which case
+    // all nested operations must be invalidated at removed from the live map as
+    // well.
+    erase();
   }
 }
 
@@ -1527,14 +1542,8 @@ py::object PyOperation::createOpView() {
 
 void PyOperation::erase() {
   checkValid();
-  // TODO: Fix memory hazards when erasing a tree of operations for which a deep
-  // Python reference to a child operation is live. All children should also
-  // have their `valid` bit set to false.
-  auto &liveOperations = getContext()->liveOperations;
-  if (liveOperations.count(operation.ptr))
-    liveOperations.erase(operation.ptr);
+  getContext()->clearOperationAndInside(*this);
   mlirOperationDestroy(operation);
-  valid = false;
 }
 
 //------------------------------------------------------------------------------
diff --git a/mlir/lib/Bindings/Python/IRModule.h b/mlir/lib/Bindings/Python/IRModule.h
index b038a0c54d29b..8c34c11f70950 100644
--- a/mlir/lib/Bindings/Python/IRModule.h
+++ b/mlir/lib/Bindings/Python/IRModule.h
@@ -218,6 +218,8 @@ class PyMlirContext {
   /// This is useful for when some non-bindings code destroys the operation and
   /// the bindings need to made aware. For example, in the case when pass
   /// manager is run.
+  ///
+  /// Note that this does *NOT* clear the nested operations.
   void clearOperation(MlirOperation op);
 
   /// Clears all operations nested inside the given op using
@@ -225,6 +227,10 @@ class PyMlirContext {
   void clearOperationsInside(PyOperationBase &op);
   void clearOperationsInside(MlirOperation op);
 
+  /// Clears the operaiton _and_ all operations inside using
+  /// `clearOperation(MlirOperation)`.
+  void clearOperationAndInside(PyOperationBase &op);
+
   /// Gets the count of live modules associated with this context.
   /// Used for testing.
   size_t getLiveModuleCount();
@@ -246,6 +252,7 @@ class PyMlirContext {
 
 private:
   PyMlirContext(MlirContext context);
+
   // Interns the mapping of live MlirContext::ptr to PyMlirContext instances,
   // preserving the relationship that an MlirContext maps to a single
   // PyMlirContext wrapper. This could be replaced in the future with an
diff --git a/mlir/test/python/live_operations.py b/mlir/test/python/live_operations.py
new file mode 100644
index 0000000000000..892ed1715f6c7
--- /dev/null
+++ b/mlir/test/python/live_operations.py
@@ -0,0 +1,46 @@
+# RUN: %PYTHON %s
+# It is sufficient that this doesn't assert.
+
+from mlir.ir import *
+
+
+def createDetachedModule():
+    module = Module.create()
+    with InsertionPoint(module.body):
+        # TODO: Python bindings are currently unaware that modules are also
+        # operations, so having a module erased won't trigger the cascading
+        # removal of live operations (#93337). Use a non-module operation
+        # instead.
+        nested = Operation.create("test.some_operation", regions=1)
+
+        # When the operation is detached from parent, it is considered to be
+        # owned by Python. It will therefore be erased when the Python object
+        # is destroyed.
+        nested.detach_from_parent()
+
+        # However, we create and maintain references to operations within
+        # `nested`. These references keep the corresponding operations in the
+        # "live" list even if they have been erased in C++, making them
+        # "zombie". If the C++ allocator reuses one of the address previously
+        # used for a now-"zombie" operation, this used to result in an
+        # assertion "cannot create detached operation that already exists" from
+        # the bindings code. Erasing the detached operation should result in
+        # removing all nested operations from the live list.
+        #
+        # Note that the assertion is not guaranteed since it depends on the
+        # behavior of the allocator on the C++ side, so this test mail fail
+        # intermittently.
+        with InsertionPoint(nested.regions[0].blocks.append()):
+            a = [Operation.create("test.some_other_operation") for i in range(100)]
+    return a
+
+
+def createManyDetachedModules():
+    with Context() as ctx, Location.unknown():
+        ctx.allow_unregistered_dialects = True
+        for j in range(100):
+            a = createDetachedModule()
+
+
+if __name__ == "__main__":
+    createManyDetachedModules()

stellaraccident · 2024-05-24T20:07:00Z

How do you want to do this? Land this or see if we can get the live operations excision patch landed?

makslevental · 2024-05-24T20:11:47Z

The patch in question #92631

ftynse · 2024-05-24T21:09:26Z

I haven't entirely followed the other discussion. What's the plan there, is there consensus to remove the live operations mechanism entirely? I'm not sure whether going from "access to IR objects won't crash (best effort)" to "access to IR object will almost always crash" is a UX improvement.

stellaraccident · 2024-05-24T21:43:28Z

It makes them more like attributes and types where we return a python wrapper backed by an operation* that has a lifetime for however long you hold the wrapper. It is up to the user to not delete IR structure that have live references in their code, and there is no secret map of dangling pointers.

We had discussed also adding some kind of arena context manager that would hold all PyOperations allocated in its scope and allow you to invalidate them (or invalidate at scope exit).

There really isn't a better way in the context of what MLIR is. If we went back to something super heavy handed, that would probably push me to maintain out of tree bindings optimized for efficiency of IR construction and constrained traversal, similar to the lite LLVM python bindings that are used for production cases where efficiency matters.

I was going to patch the excision patch into some projects and see how bad it is. I think the potential for dangling pointers that are induced by the bindings itself is the worst world and we need to exit that.

makslevental · 2024-05-24T21:45:16Z

I'm not sure whether going from "access to IR objects won't crash (best effort)" to "access to IR object will almost always crash" is a UX improvement.

Hey I've been told multiple times that the bindings should have no more QoL features than the C++ codebase itself 😛.

I kid. It's true and I don't see a way out there - I was hoping Stella would come back with some ultra-wise/clever solution.

stellaraccident · 2024-05-24T21:49:20Z

In this case, unless if you can hook every way in c++ that an operation* can be invalidated or have some form of "weak opptr" semantics, there is no better way. As we've seen, the attempt to keep a parallel accounting itself is its own enemy due to pointer reuse.

We could try some form of weak reference list to the PyOperation itself (ie. As in the arena thought experiment), but my guess is that the best use of that would be for some kind of safe mode where we could actively warn or except if mutations we know about are dangling. Without an explicit arena, though, it will be very noisy because the python GC is not precise.

stellaraccident · 2024-05-24T21:56:53Z

If there was an extra int slot on every operation, you could manage something by assigning a non reusable identity value to any operation with outside bindings. But there is not such a slot, and I don't think we would just make such a thing. It would have to be part of some kind of c++ feature for holding weakrefs to operations...

makslevental · 2024-05-24T21:59:00Z

We had discussed also adding some kind of arena context manager that would hold all PyOperations allocated in its scope and allow you to invalidate them (or invalidate at scope exit).

This is workable I think? It allows people to control their exposure by narrowing and widening the scope (i think?).

I think the potential for dangling pointers that are induced by the bindings itself is the worst world and we need to exit that... As we've seen, the attempt to keep a parallel accounting itself is its own enemy due to pointer reuse.

Yea I agree with this one - the worst papercuts are the ones you can't see but can feel. At the python that's stuff that happens in the cpp. At least after the excision/removal of liveOperations, if you get a crash you can bisect the python to find where/how.

In this case, unless if you can hook every way in c++ that an operation* can be invalidated or have some form of "weak opptr" semantics, there is no better way.

Yea I was thinking well shared_ptr or something but of course Operation* would have to itself be behind that and it still wouldn't work because it won't pass through the C API.

We could try some form of weak reference list to the PyOperation itself (ie. As in the arena thought experiment), but my guess is that the best use of that would be for some kind of safe mode where we could actively warn or except if mutations we know about are dangling. Without an explicit arena, though, it will be very noisy because the python GC is not precise.

How would that work? I think I vaguely understand that Py_XINCREF/Py_XDECREF is for something like that?

stellaraccident · 2024-05-24T22:06:58Z

How would that work? I think I vaguely understand that Py_XINCREF/Py_XDECREF is for something like that?

There are multiple ways to implement a weak reference system, and this would be one of them. If you had a nonce value that could be initialized for each operation, then you could provide a query mechanism to check whether the reference is still alive. That could give a safe way to detect invalidation and act accordingly.

You could use the Operation* itself as the key into such a table of weak reference objects, but the accounting would have to be done by the operation/context and tied to the Operation destructor, or else the pointer reuse could not be bounded.

So such a system would require either a nonce slot on each operation instance or additional context scoped accounting in the Operation destructor. So a memory overhead or a destruction overhead. No other thing is possible with non managed pointers.

stellaraccident · 2024-05-24T22:16:21Z

This is workable I think? It allows people to control their exposure by narrowing and widening the scope (i think?).

As long as there is a mode that lets me not do that. In the order of priorities for these bindings, the primary use case is IR construction. The secondary use case is what I've been calling "lightweight IR" traversal. A third is trying to implement some kind of compiler framework in Python, but I would argue strongly that this use case in generality is in conflict with the first two and if serious about it, we would be better off having two bindings.

The first and second use cases could be well served by some kind of safety scoping arena, as long as it wasn't required for use: there are plenty of cases where there is no bounding stack context for such things.

stellaraccident · 2024-05-24T22:56:29Z

To do this right, we need weak references to operations. And for that, we need an extra pointer in the c++ operation that can optionally point to a single weak reference object with shared ptr semantics of some kind.

There's nowhere to easily steal such a pointer in Operation, so this would be a new cost. It could probably be made optional based on a context flag to enable weak references -- with the pointer storage itself being dynamically placed in the prefix storage area used for excess op result storage when the feature is enabled.

When disabled, the cost would be a bool check on the context flag at allocation and destruction time.

Then the PyOperation would be changed to only access the MlirOperation via a new mlirOperationRefDeref() on a new MlirOperationRef object. If a reference could not be created (context not set up for weak operations) or dereference (Operation already destroyed), a python exception would be raised. Overhead would be paid completely by the user of the feature, not counting those two extra bool branches.

Would be 100% safe.

ftynse · 2024-05-27T17:09:26Z

I was thinking to have a hook in Operation::erase, with the callback itself stored in the context, that could be used to update an externalized weak reference table. When not in use, the cost is just one pointer in the context + a branch in the destructor. When in use, the cost would be higher than just having references in the operation I suppose, depending on the actual implementaiton of the table but an indirect call is also not free.

On the Python side, having a scoped arena or a generation counter in all ops (large mutations bump the counter, ops with lower counter value than the current are considered invalid) would work for me. I'd really prefer having at least some safety mechanism, can be opt in. Unlike attributes and types, which are owned by the context and have a natural scope to them, there is more possibility for operations to get out of scope and deleted. For example, I've seen op parsing + deletion after certain pieces (regions, attributes) are extracted and copied over.

stellaraccident · 2024-05-27T17:18:27Z

Maybe we should land this patch to stem your issue and aim at fixing this once and for all at the MLIR c++ level.

The issue with a single callback in the context is that there is only one and contexts can come from elsewhere.

However that could work if you built the weak reference accounting into the context. Could probably finagle it to only start doing the destructor callback if there were live references... But would need to think carefully about the threading.

(It would then also not be an indirect call but a predicated call that may be skipped)

stellaraccident · 2024-05-27T17:23:53Z

We could store the hasWeakRef bit in the operation and just have it always call a context level facility to invalidate on destruct if the bit is set. There is space to stash a bit in the operation, just not a pointer.

Advantage to that would be that it wouldn't require atomics or other things that interact badly with threading.

ftynse · 2024-05-28T11:48:31Z

Maybe we should land this patch to stem your issue and aim at fixing this once and for all at the MLIR c++ level.

We can always land and just revert just before landing the excision/update patch.

stellaraccident

LGTM as a patch on the issue. I think we need to add some form of weak references on the C++ side to make this more robust.

makslevental · 2024-05-30T05:52:04Z

I think we need to add some form of weak references on the C++ side to make this more robust.

Is there going to be opposition to this because (so far) the only use-case is supporting the Python bindings?

ftynse · 2024-05-30T08:05:56Z

Is there going to be opposition to this because (so far) the only use-case is supporting the Python bindings?

We won't know until we try... There are pure C++ use cases, e.g., all the operation maps in various rewrite drivers could use weak references, at least under some debugging flag/ifdef. Having a hook in Operation::erase would have saved countless hours of debugging rewrite patterns that bypass `RewriterBase::eraseOp leading to non-asan'able errors.

stellaraccident · 2024-05-30T16:00:21Z

I think I could write a convincing POC and RFC. I'm on a deadline right now though and don't have the time for a diversion just this minute. Open to someone else giving it a try, but I think it is the right thing to do and will give it a try when I have the time.

makslevental · 2024-05-30T22:08:06Z

I think I could write a convincing POC and RFC. I'm on a deadline right now though and don't have the time for a diversion just this minute. Open to someone else giving it a try, but I think it is the right thing to do and will give it a try when I have the time.

I would like to give this one a try but it's a little extra-curricular at the moment :) I think~2 weeks at most until I'll have room. Can it hold until then?

stellaraccident · 2024-05-31T02:55:22Z

It is almost a certainty I won't have attention to spare on this in the next two weeks. If I do, it will be too doodle something as a distraction to what I should be working on :) if I do that, I'll share anything I manage

ftynse requested review from makslevental and stellaraccident as code owners May 24, 2024 19:56

llvmbot added the mlir label May 24, 2024

ftynse mentioned this pull request May 24, 2024

[mlir][python] Tracking of PyOperations references in liveOperations map breaks down in reasonable scenarios #92344

Open

stellaraccident approved these changes May 30, 2024

View reviewed changes

ftynse merged commit 67897d7 into llvm:main May 30, 2024

ftynse deleted the live-operations branch May 30, 2024 08:06

makslevental mentioned this pull request Nov 14, 2024

[Python] Develop python bindings for Presburger library #113233

Draft

makslevental mentioned this pull request May 13, 2025

[MLIR] Existing Python bindings issues #139785

Open

[mlir][py] invalidate nested operations when parent is deleted #93339

[mlir][py] invalidate nested operations when parent is deleted #93339

Uh oh!

Conversation

ftynse commented May 24, 2024 • edited by tyb0807 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 24, 2024

Uh oh!

stellaraccident commented May 24, 2024

Uh oh!

makslevental commented May 24, 2024

Uh oh!

ftynse commented May 24, 2024

Uh oh!

stellaraccident commented May 24, 2024

Uh oh!

makslevental commented May 24, 2024

Uh oh!

stellaraccident commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stellaraccident commented May 24, 2024

Uh oh!

makslevental commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stellaraccident commented May 24, 2024

Uh oh!

stellaraccident commented May 24, 2024

Uh oh!

stellaraccident commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ftynse commented May 27, 2024

Uh oh!

stellaraccident commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stellaraccident commented May 27, 2024

Uh oh!

ftynse commented May 28, 2024

Uh oh!

stellaraccident left a comment

Choose a reason for hiding this comment

Uh oh!

makslevental commented May 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ftynse commented May 30, 2024

Uh oh!

stellaraccident commented May 30, 2024

Uh oh!

makslevental commented May 30, 2024

Uh oh!

stellaraccident commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ftynse commented May 24, 2024 •

edited by tyb0807

Loading

stellaraccident commented May 24, 2024 •

edited

Loading

makslevental commented May 24, 2024 •

edited

Loading

stellaraccident commented May 24, 2024 •

edited

Loading

stellaraccident commented May 27, 2024 •

edited

Loading

makslevental commented May 30, 2024 •

edited

Loading

stellaraccident commented May 31, 2024 •

edited

Loading