-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][py] invalidate nested operations when parent is deleted #93339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations form the "live" list when erasing the parent. This also concentrates responsibility over the removal from the "live" list and invalidation in a single place. Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do.
@llvm/pr-subscribers-mlir Author: Oleksandr "Alex" Zinenko (ftynse) ChangesWhen an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations form the "live" list when erasing the parent. This also concentrates responsibility over the removal from the "live" list and invalidation in a single place. Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do. Full diff: https://github.com/llvm/llvm-project/pull/93339.diff 3 Files Affected:
diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 01678a9719f90..f03c540d618cd 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -684,6 +684,17 @@ void PyMlirContext::clearOperationsInside(MlirOperation op) {
clearOperationsInside(opRef->getOperation());
}
+void PyMlirContext::clearOperationAndInside(PyOperationBase &op) {
+ MlirOperationWalkCallback invalidatingCallback = [](MlirOperation op,
+ void *userData) {
+ PyMlirContextRef &contextRef = *static_cast<PyMlirContextRef *>(userData);
+ contextRef->clearOperation(op);
+ return MlirWalkResult::MlirWalkResultAdvance;
+ };
+ mlirOperationWalk(op.getOperation(), invalidatingCallback,
+ &op.getOperation().getContext(), MlirWalkPreOrder);
+}
+
size_t PyMlirContext::getLiveModuleCount() { return liveModules.size(); }
pybind11::object PyMlirContext::contextEnter() {
@@ -1112,12 +1123,16 @@ PyOperation::~PyOperation() {
// If the operation has already been invalidated there is nothing to do.
if (!valid)
return;
- auto &liveOperations = getContext()->liveOperations;
- assert(liveOperations.count(operation.ptr) == 1 &&
- "destroying operation not in live map");
- liveOperations.erase(operation.ptr);
- if (!isAttached()) {
- mlirOperationDestroy(operation);
+
+ // Otherwise, invalidate the operation and remove it from live map when it is
+ // attached.
+ if (isAttached()) {
+ getContext()->clearOperation(*this);
+ } else {
+ // And destroy it when it is detached, i.e. owned by Python, in which case
+ // all nested operations must be invalidated at removed from the live map as
+ // well.
+ erase();
}
}
@@ -1527,14 +1542,8 @@ py::object PyOperation::createOpView() {
void PyOperation::erase() {
checkValid();
- // TODO: Fix memory hazards when erasing a tree of operations for which a deep
- // Python reference to a child operation is live. All children should also
- // have their `valid` bit set to false.
- auto &liveOperations = getContext()->liveOperations;
- if (liveOperations.count(operation.ptr))
- liveOperations.erase(operation.ptr);
+ getContext()->clearOperationAndInside(*this);
mlirOperationDestroy(operation);
- valid = false;
}
//------------------------------------------------------------------------------
diff --git a/mlir/lib/Bindings/Python/IRModule.h b/mlir/lib/Bindings/Python/IRModule.h
index b038a0c54d29b..8c34c11f70950 100644
--- a/mlir/lib/Bindings/Python/IRModule.h
+++ b/mlir/lib/Bindings/Python/IRModule.h
@@ -218,6 +218,8 @@ class PyMlirContext {
/// This is useful for when some non-bindings code destroys the operation and
/// the bindings need to made aware. For example, in the case when pass
/// manager is run.
+ ///
+ /// Note that this does *NOT* clear the nested operations.
void clearOperation(MlirOperation op);
/// Clears all operations nested inside the given op using
@@ -225,6 +227,10 @@ class PyMlirContext {
void clearOperationsInside(PyOperationBase &op);
void clearOperationsInside(MlirOperation op);
+ /// Clears the operaiton _and_ all operations inside using
+ /// `clearOperation(MlirOperation)`.
+ void clearOperationAndInside(PyOperationBase &op);
+
/// Gets the count of live modules associated with this context.
/// Used for testing.
size_t getLiveModuleCount();
@@ -246,6 +252,7 @@ class PyMlirContext {
private:
PyMlirContext(MlirContext context);
+
// Interns the mapping of live MlirContext::ptr to PyMlirContext instances,
// preserving the relationship that an MlirContext maps to a single
// PyMlirContext wrapper. This could be replaced in the future with an
diff --git a/mlir/test/python/live_operations.py b/mlir/test/python/live_operations.py
new file mode 100644
index 0000000000000..892ed1715f6c7
--- /dev/null
+++ b/mlir/test/python/live_operations.py
@@ -0,0 +1,46 @@
+# RUN: %PYTHON %s
+# It is sufficient that this doesn't assert.
+
+from mlir.ir import *
+
+
+def createDetachedModule():
+ module = Module.create()
+ with InsertionPoint(module.body):
+ # TODO: Python bindings are currently unaware that modules are also
+ # operations, so having a module erased won't trigger the cascading
+ # removal of live operations (#93337). Use a non-module operation
+ # instead.
+ nested = Operation.create("test.some_operation", regions=1)
+
+ # When the operation is detached from parent, it is considered to be
+ # owned by Python. It will therefore be erased when the Python object
+ # is destroyed.
+ nested.detach_from_parent()
+
+ # However, we create and maintain references to operations within
+ # `nested`. These references keep the corresponding operations in the
+ # "live" list even if they have been erased in C++, making them
+ # "zombie". If the C++ allocator reuses one of the address previously
+ # used for a now-"zombie" operation, this used to result in an
+ # assertion "cannot create detached operation that already exists" from
+ # the bindings code. Erasing the detached operation should result in
+ # removing all nested operations from the live list.
+ #
+ # Note that the assertion is not guaranteed since it depends on the
+ # behavior of the allocator on the C++ side, so this test mail fail
+ # intermittently.
+ with InsertionPoint(nested.regions[0].blocks.append()):
+ a = [Operation.create("test.some_other_operation") for i in range(100)]
+ return a
+
+
+def createManyDetachedModules():
+ with Context() as ctx, Location.unknown():
+ ctx.allow_unregistered_dialects = True
+ for j in range(100):
+ a = createDetachedModule()
+
+
+if __name__ == "__main__":
+ createManyDetachedModules()
|
How do you want to do this? Land this or see if we can get the live operations excision patch landed? |
The patch in question #92631 |
I haven't entirely followed the other discussion. What's the plan there, is there consensus to remove the live operations mechanism entirely? I'm not sure whether going from "access to IR objects won't crash (best effort)" to "access to IR object will almost always crash" is a UX improvement. |
It makes them more like attributes and types where we return a python wrapper backed by an operation* that has a lifetime for however long you hold the wrapper. It is up to the user to not delete IR structure that have live references in their code, and there is no secret map of dangling pointers. We had discussed also adding some kind of arena context manager that would hold all PyOperations allocated in its scope and allow you to invalidate them (or invalidate at scope exit). There really isn't a better way in the context of what MLIR is. If we went back to something super heavy handed, that would probably push me to maintain out of tree bindings optimized for efficiency of IR construction and constrained traversal, similar to the lite LLVM python bindings that are used for production cases where efficiency matters. I was going to patch the excision patch into some projects and see how bad it is. I think the potential for dangling pointers that are induced by the bindings itself is the worst world and we need to exit that. |
Hey I've been told multiple times that the bindings should have no more QoL features than the C++ codebase itself 😛. I kid. It's true and I don't see a way out there - I was hoping Stella would come back with some ultra-wise/clever solution. |
In this case, unless if you can hook every way in c++ that an operation* can be invalidated or have some form of "weak opptr" semantics, there is no better way. As we've seen, the attempt to keep a parallel accounting itself is its own enemy due to pointer reuse. We could try some form of weak reference list to the PyOperation itself (ie. As in the arena thought experiment), but my guess is that the best use of that would be for some kind of safe mode where we could actively warn or except if mutations we know about are dangling. Without an explicit arena, though, it will be very noisy because the python GC is not precise. |
If there was an extra int slot on every operation, you could manage something by assigning a non reusable identity value to any operation with outside bindings. But there is not such a slot, and I don't think we would just make such a thing. It would have to be part of some kind of c++ feature for holding weakrefs to operations... |
This is workable I think? It allows people to control their exposure by narrowing and widening the scope (i think?).
Yea I agree with this one - the worst papercuts are the ones you can't see but can feel. At the python that's stuff that happens in the cpp. At least after the excision/removal of
Yea I was thinking well
How would that work? I think I vaguely understand that |
There are multiple ways to implement a weak reference system, and this would be one of them. If you had a nonce value that could be initialized for each operation, then you could provide a query mechanism to check whether the reference is still alive. That could give a safe way to detect invalidation and act accordingly. You could use the Operation* itself as the key into such a table of weak reference objects, but the accounting would have to be done by the operation/context and tied to the Operation destructor, or else the pointer reuse could not be bounded. So such a system would require either a nonce slot on each operation instance or additional context scoped accounting in the Operation destructor. So a memory overhead or a destruction overhead. No other thing is possible with non managed pointers. |
As long as there is a mode that lets me not do that. In the order of priorities for these bindings, the primary use case is IR construction. The secondary use case is what I've been calling "lightweight IR" traversal. A third is trying to implement some kind of compiler framework in Python, but I would argue strongly that this use case in generality is in conflict with the first two and if serious about it, we would be better off having two bindings. The first and second use cases could be well served by some kind of safety scoping arena, as long as it wasn't required for use: there are plenty of cases where there is no bounding stack context for such things. |
To do this right, we need weak references to operations. And for that, we need an extra pointer in the c++ operation that can optionally point to a single weak reference object with shared ptr semantics of some kind. There's nowhere to easily steal such a pointer in Operation, so this would be a new cost. It could probably be made optional based on a context flag to enable weak references -- with the pointer storage itself being dynamically placed in the prefix storage area used for excess op result storage when the feature is enabled. When disabled, the cost would be a bool check on the context flag at allocation and destruction time. Then the PyOperation would be changed to only access the MlirOperation via a new mlirOperationRefDeref() on a new MlirOperationRef object. If a reference could not be created (context not set up for weak operations) or dereference (Operation already destroyed), a python exception would be raised. Overhead would be paid completely by the user of the feature, not counting those two extra bool branches. Would be 100% safe. |
I was thinking to have a hook in On the Python side, having a scoped arena or a generation counter in all ops (large mutations bump the counter, ops with lower counter value than the current are considered invalid) would work for me. I'd really prefer having at least some safety mechanism, can be opt in. Unlike attributes and types, which are owned by the context and have a natural scope to them, there is more possibility for operations to get out of scope and deleted. For example, I've seen op parsing + deletion after certain pieces (regions, attributes) are extracted and copied over. |
Maybe we should land this patch to stem your issue and aim at fixing this once and for all at the MLIR c++ level. The issue with a single callback in the context is that there is only one and contexts can come from elsewhere. However that could work if you built the weak reference accounting into the context. Could probably finagle it to only start doing the destructor callback if there were live references... But would need to think carefully about the threading. (It would then also not be an indirect call but a predicated call that may be skipped) |
We could store the hasWeakRef bit in the operation and just have it always call a context level facility to invalidate on destruct if the bit is set. There is space to stash a bit in the operation, just not a pointer. Advantage to that would be that it wouldn't require atomics or other things that interact badly with threading. |
We can always land and just revert just before landing the excision/update patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a patch on the issue. I think we need to add some form of weak references on the C++ side to make this more robust.
Is there going to be opposition to this because (so far) the only use-case is supporting the Python bindings? |
We won't know until we try... There are pure C++ use cases, e.g., all the operation maps in various rewrite drivers could use weak references, at least under some debugging flag/ifdef. Having a hook in |
I think I could write a convincing POC and RFC. I'm on a deadline right now though and don't have the time for a diversion just this minute. Open to someone else giving it a try, but I think it is the right thing to do and will give it a try when I have the time. |
I would like to give this one a try but it's a little extra-curricular at the moment :) I think~2 weeks at most until I'll have room. Can it hold until then? |
It is almost a certainty I won't have attention to spare on this in the next two weeks. If I do, it will be too doodle something as a distraction to what I should be working on :) if I do that, I'll share anything I manage |
When an operation is erased in Python, its children may still be in the "live" list inside Python bindings. After this, if some of the newly allocated operations happen to reuse the same pointer address, this will trigger an assertion in the bindings. This assertion would be incorrect because the operations aren't actually live. Make sure we remove the children operations from the "live" list when erasing the parent.
This also concentrates responsibility over the removal from the "live" list and invalidation in a single place.
Note that this requires the IR to be sufficiently structurally valid so a walk through it can succeed. If this invariant was broken by, e.g, C++ pass called from Python, there isn't much we can do.