[mlir][python] Tracking of live operations fails. #69730

ingomueller-net · 2023-10-20T14:10:04Z

I am running into a problem with the Python bindings of MLIR related to how PyOperations track the underlying Operations.

If I understand things right, when a PyOperation is created, the pointer to the underlying Operation is tracked in PyMlirContext::liveOperations (e.g., here). This is used (potentially among other things) to return the same PyOperation for existing Operations, e.g., here.

However, it is possible to obtain a PyOperation to an Operation that is then destroyed without the Python binding noticing, for example, by running a pass that deletes that Operation. At that point, PyMlirContext::liveOperations has an entry for an Operation that doesn't exist anymore. It is then possible that that memory location is used for a new Operation that is created through PyOperation::createDetached (for example, through PyOperation::clone, which is called by the Python-bound method _OperationBase.clone). PyMlirContext::liveOperations thus contains the pointer of the previous operation that has been destroyed in the meantime and PyOperation::createDetached wants to add that same pointer again for a (supposedly not-yet-existing) "new" Operation. Accordingly, the assertion in PyOperation::createDetached fails.

I am unsure what the best way to fix this is. After a day of thinking about the problem, my best guess is to say that such a tracking is, in fact, not possible. My second best guess is to say that it is illegal to hold PyOperation to ops that could be modified by some other functions but (1) that seems quite error-prone and (2) I haven't been able to find out how to release the PyOperation that is causing the problem in my case. Before I spend more time to think about this topic, let me ask for input from the people that are more familiar with that part of the code: @stellaraccident, @teqdruid, @mikeurbach, @ftynse.

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-10-20T14:51:17Z

@llvm/issue-subscribers-mlir-python

Author: Ingo Müller (ingomueller-net)

I am running into a problem with the Python bindings of MLIR related to how `PyOperation`s track the underlying `Operation`s.

If I understand things right, when a PyOperation is created, the pointer to the underlying Operation is tracked in PyMlirContext::liveOperations (e.g., here). This is used (potentially among other things) to return the same PyOperation for existing Operations, e.g., here.

However, it is possible to obtain a PyOperation to an Operation that is then destroyed without the Python binding noticing, for example, by running a pass that deletes that Operation. At that point, PyMlirContext::liveOperations has an entry for an Operation that doesn't exist anymore. It is then possible that that memory location is used for a new Operation that is created through PyOperation::createDetached (for example, through PyOperation::clone, which is called by the Python-bound method _OperationBase.clone). PyMlirContext::liveOperations thus contains the pointer of the previous operation that has been destroyed in the meantime and PyOperation::createDetached wants to add that same pointer again for a (supposedly not-yet-existing) "new" Operation. Accordingly, the assertion in PyOperation::createDetached fails.

I am unsure what the best way to fix this is. After a day of thinking about the problem, my best guess is to say that such a tracking is, in fact, not possible. My second best guess is to say that it is illegal to hold PyOperation to ops that could be modified by some other functions but (1) that seems quite error-prone and (2) I haven't been able to find out how to release the PyOperation that is causing the problem in my case. Before I spend more time to think about this topic, let me ask for input from the people that are more familiar with that part of the code: @stellaraccident, @teqdruid, @mikeurbach, @ftynse.

mikeurbach · 2023-10-20T15:14:13Z

I think your analysis is spot on. If I am remembering correctly, @teqdruid and I hit exactly this issue a while ago, and nothing has changed in that regard. He is the real expert here, but this is something that we wanted to bring up in the MLIR workshop last week. I'll try to convey the solution in PyCDE, but maybe John can chime in.

The goal was to support a workflow like this, without running into the issue you're describing:

create some operations via Python bindings
run some passes which mutate the IR
traverse the IR / create some new operations via Python bindings

For this, PyCDE ends up maintaining its own in-memory cache, which is busted whenever we handoff to MLIR passes that may mutate the IR: https://github.com/llvm/circt/blob/2f8381cd1bdf4411e5354038258a17bd717b7130/frontends/PyCDE/src/system.py#L356

stellaraccident · 2023-10-20T17:07:50Z

There were some recent (last six months or so) patches that tried to use the "invalidation" support in ops to do bulk invalidation (which servers the python op object from the underlying c++). But I don't know the status. I believe the thinking was that more here would fill more gaps. But all of our use cases are "builder style" so I haven't spent much time recently on it.

Maybe @makslevental

makslevental · 2023-10-20T17:50:14Z

Caught red-handed; I did indeed implement "invalidation" here https://reviews.llvm.org/D155543 but didn't get it across the finish line. Was hoping I would be able to take care of it before anyone noticed but things keep getting "piled higher and deeper". Let me migrate that patch to a PR and we can discuss.

ftynse · 2023-10-20T18:25:41Z

+1 for invalidation. I don't think we can realistically track all potentially ways to delete operations in C++ from Python (I'd love to have that in C++ to start with, so much less pain debugging). Our model in bindings originally was "construct IR, then run passes". If handles to IR objects are needed after applying the pass manager, they must be obtained by re-traversing the transformed IR. We could consider some sort of scoped invalidation where only the handles to operations nested under the pass manager root get invalidated.

Interestingly enough, this is a special case of the infamous "handle invalidation" problem that we've hit repeatedly in the transform dialect, with and without Python. We are now able to track through a subset of op replacements using pattern rewriter listeners, but this does not generalize to non pattern-based passes so the safer assumption would be to just invalidate everything under the pass root.

cc @martin-luecke

teqdruid · 2023-10-20T21:06:16Z

I think your analysis is spot on. If I am remembering correctly, @teqdruid and I hit exactly this issue a while ago, and nothing has changed in that regard. He is the real expert here, but this is something that we wanted to bring up in the MLIR workshop last week. I'll try to convey the solution in PyCDE, but maybe John can chime in

Yeah, I've run into this a number of times. It's extremely confusing to be debugging a Python class instance for op A only to learn that the MlirOperation is referring to a new op which is usually not even the same operation type! TBF, pointer invalidation is an issue in the C++ world as well.

Mike is essentially correct about the PyCDE solution. We try our best not to expose raw MLIR Python class instances to the user, but Python is Python so there's only so much one can do to hide things. IIRC I did add a call to MLIR to force the entire C++ (Pybind) cache to get invalidated which (IIRC) notifies the Python class. PyCDE invalidates its own cache (by which point the liveOperations cache should be empty), then for safety checks that it is indeed cleared, issues a warning if it isn't, then forces a cache invalidation. My memory is hazy since I did all this 1+ years ago. (I can barely remember what I had for breakfast.)

Flushing the entire cache is a bit heavy-handed. I once proposed that the Operation deconstructor should support calling a registered hook to inform anything that needs to know when an Operation has been erased. This method would work regardless of how said operation was deleted. IIRC the primary objection was performance-related -- adding a pointer null check is too much overhead in the common case wherein one wasn't using the callback. If callback was per-operation, that pointer addition to the Operation class would bloat the Operation memory. (I think it's more desirable to register it in the MLIRContext though.)

Fixes #69730 (also see https://reviews.llvm.org/D155543). There are two things outstanding (why I didn't land before): 1. add some C API tests for `mlirOperationWalk`; 2. potentially refactor how the invalidation in `run` works; the first version of the code looked like this: ```cpp if (invalidateOps) { auto *context = op.getOperation().getContext().get(); MlirOperationWalkCallback invalidatingCallback = [](MlirOperation op, void *userData) { PyMlirContext *context = static_cast<PyMlirContext *>(userData); context->setOperationInvalid(op); }; auto numRegions = mlirOperationGetNumRegions(op.getOperation().get()); for (int i = 0; i < numRegions; ++i) { MlirRegion region = mlirOperationGetRegion(op.getOperation().get(), i); for (MlirBlock block = mlirRegionGetFirstBlock(region); !mlirBlockIsNull(block); block = mlirBlockGetNextInRegion(block)) for (MlirOperation childOp = mlirBlockGetFirstOperation(block); !mlirOperationIsNull(childOp); childOp = mlirOperationGetNextInBlock(childOp)) mlirOperationWalk(childOp, invalidatingCallback, context, MlirWalkPostOrder); } } ``` This is verbose and ugly but it has the important benefit of not executing `mlirOperationEqual(rootOp->get(), op)` for every op underneath the root op. Supposing there's no desire for the slightly more efficient but highly convoluted approach, I can land this "posthaste". But, since we have eyes on this now, any suggestions or approaches (or needs/concerns) are welcome.

`PyOperations` are Python-level handles to `Operation *` instances. When the latter are modified by C++, the former need to be invalidated. llvm#69746 implements such invalidation mechanism by setting all `PyReferences` to `invalid`. However, that is not enough: they also need to be removed from the `liveOperations` map since other parts of the code (such as `PyOperation::createDetached`) assume that that map only contains valid refs. This is required to actually solve the issue in llvm#69730.

`PyOperations` are Python-level handles to `Operation *` instances. When the latter are modified by C++, the former need to be invalidated. #69746 implements such invalidation mechanism by setting all `PyReferences` to `invalid`. However, that is not enough: they also need to be removed from the `liveOperations` map since other parts of the code (such as `PyOperation::createDetached`) assume that that map only contains valid refs. This is required to actually solve the issue in #69730.

github-actions bot added the mlir label Oct 20, 2023

EugeneZelenko added mlir:python MLIR Python bindings and removed mlir labels Oct 20, 2023

makslevental self-assigned this Oct 20, 2023

makslevental mentioned this issue Oct 20, 2023

[MLIR][python bindings] invalidate ops after PassManager run #69746

Merged

makslevental closed this as completed in #69746 Oct 21, 2023

ingomueller-net mentioned this issue Oct 24, 2023

[mlir][python] Clear PyOperations instead of invalidating them. #70044

Merged

makslevental mentioned this issue Nov 14, 2024

[Python] Develop python bindings for Presburger library #113233

Draft

makslevental mentioned this issue May 13, 2025

[MLIR] Existing Python bindings issues #139785

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][python] Tracking of live operations fails. #69730

[mlir][python] Tracking of live operations fails. #69730

ingomueller-net commented Oct 20, 2023

llvmbot commented Oct 20, 2023

mikeurbach commented Oct 20, 2023

stellaraccident commented Oct 20, 2023

makslevental commented Oct 20, 2023

ftynse commented Oct 20, 2023

teqdruid commented Oct 20, 2023

[mlir][python] Tracking of live operations fails. #69730

[mlir][python] Tracking of live operations fails. #69730

Comments

ingomueller-net commented Oct 20, 2023

llvmbot commented Oct 20, 2023

mikeurbach commented Oct 20, 2023

stellaraccident commented Oct 20, 2023

makslevental commented Oct 20, 2023

ftynse commented Oct 20, 2023

teqdruid commented Oct 20, 2023