Skip to content

[mlir][python] Tracking of live operations fails. #69730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ingomueller-net opened this issue Oct 20, 2023 · 6 comments · Fixed by #69746
Closed

[mlir][python] Tracking of live operations fails. #69730

ingomueller-net opened this issue Oct 20, 2023 · 6 comments · Fixed by #69746
Assignees
Labels
mlir:python MLIR Python bindings

Comments

@ingomueller-net
Copy link
Contributor

I am running into a problem with the Python bindings of MLIR related to how PyOperations track the underlying Operations.

If I understand things right, when a PyOperation is created, the pointer to the underlying Operation is tracked in PyMlirContext::liveOperations (e.g., here). This is used (potentially among other things) to return the same PyOperation for existing Operations, e.g., here.

However, it is possible to obtain a PyOperation to an Operation that is then destroyed without the Python binding noticing, for example, by running a pass that deletes that Operation. At that point, PyMlirContext::liveOperations has an entry for an Operation that doesn't exist anymore. It is then possible that that memory location is used for a new Operation that is created through PyOperation::createDetached (for example, through PyOperation::clone, which is called by the Python-bound method _OperationBase.clone). PyMlirContext::liveOperations thus contains the pointer of the previous operation that has been destroyed in the meantime and PyOperation::createDetached wants to add that same pointer again for a (supposedly not-yet-existing) "new" Operation. Accordingly, the assertion in PyOperation::createDetached fails.

I am unsure what the best way to fix this is. After a day of thinking about the problem, my best guess is to say that such a tracking is, in fact, not possible. My second best guess is to say that it is illegal to hold PyOperation to ops that could be modified by some other functions but (1) that seems quite error-prone and (2) I haven't been able to find out how to release the PyOperation that is causing the problem in my case. Before I spend more time to think about this topic, let me ask for input from the people that are more familiar with that part of the code: @stellaraccident, @teqdruid, @mikeurbach, @ftynse.

@github-actions github-actions bot added the mlir label Oct 20, 2023
@EugeneZelenko EugeneZelenko added mlir:python MLIR Python bindings and removed mlir labels Oct 20, 2023
@llvmbot
Copy link
Member

llvmbot commented Oct 20, 2023

@llvm/issue-subscribers-mlir-python

Author: Ingo Müller (ingomueller-net)

I am running into a problem with the Python bindings of MLIR related to how `PyOperation`s track the underlying `Operation`s.

If I understand things right, when a PyOperation is created, the pointer to the underlying Operation is tracked in PyMlirContext::liveOperations (e.g., here). This is used (potentially among other things) to return the same PyOperation for existing Operations, e.g., here.

However, it is possible to obtain a PyOperation to an Operation that is then destroyed without the Python binding noticing, for example, by running a pass that deletes that Operation. At that point, PyMlirContext::liveOperations has an entry for an Operation that doesn't exist anymore. It is then possible that that memory location is used for a new Operation that is created through PyOperation::createDetached (for example, through PyOperation::clone, which is called by the Python-bound method _OperationBase.clone). PyMlirContext::liveOperations thus contains the pointer of the previous operation that has been destroyed in the meantime and PyOperation::createDetached wants to add that same pointer again for a (supposedly not-yet-existing) "new" Operation. Accordingly, the assertion in PyOperation::createDetached fails.

I am unsure what the best way to fix this is. After a day of thinking about the problem, my best guess is to say that such a tracking is, in fact, not possible. My second best guess is to say that it is illegal to hold PyOperation to ops that could be modified by some other functions but (1) that seems quite error-prone and (2) I haven't been able to find out how to release the PyOperation that is causing the problem in my case. Before I spend more time to think about this topic, let me ask for input from the people that are more familiar with that part of the code: @stellaraccident, @teqdruid, @mikeurbach, @ftynse.

@mikeurbach
Copy link
Contributor

I think your analysis is spot on. If I am remembering correctly, @teqdruid and I hit exactly this issue a while ago, and nothing has changed in that regard. He is the real expert here, but this is something that we wanted to bring up in the MLIR workshop last week. I'll try to convey the solution in PyCDE, but maybe John can chime in.

The goal was to support a workflow like this, without running into the issue you're describing:

  • create some operations via Python bindings
  • run some passes which mutate the IR
  • traverse the IR / create some new operations via Python bindings

For this, PyCDE ends up maintaining its own in-memory cache, which is busted whenever we handoff to MLIR passes that may mutate the IR: https://github.com/llvm/circt/blob/2f8381cd1bdf4411e5354038258a17bd717b7130/frontends/PyCDE/src/system.py#L356

@stellaraccident
Copy link
Contributor

There were some recent (last six months or so) patches that tried to use the "invalidation" support in ops to do bulk invalidation (which servers the python op object from the underlying c++). But I don't know the status. I believe the thinking was that more here would fill more gaps. But all of our use cases are "builder style" so I haven't spent much time recently on it.

Maybe @makslevental

@makslevental makslevental self-assigned this Oct 20, 2023
@makslevental
Copy link
Contributor

Caught red-handed; I did indeed implement "invalidation" here https://reviews.llvm.org/D155543 but didn't get it across the finish line. Was hoping I would be able to take care of it before anyone noticed but things keep getting "piled higher and deeper". Let me migrate that patch to a PR and we can discuss.

@ftynse
Copy link
Member

ftynse commented Oct 20, 2023

+1 for invalidation. I don't think we can realistically track all potentially ways to delete operations in C++ from Python (I'd love to have that in C++ to start with, so much less pain debugging). Our model in bindings originally was "construct IR, then run passes". If handles to IR objects are needed after applying the pass manager, they must be obtained by re-traversing the transformed IR. We could consider some sort of scoped invalidation where only the handles to operations nested under the pass manager root get invalidated.

Interestingly enough, this is a special case of the infamous "handle invalidation" problem that we've hit repeatedly in the transform dialect, with and without Python. We are now able to track through a subset of op replacements using pattern rewriter listeners, but this does not generalize to non pattern-based passes so the safer assumption would be to just invalidate everything under the pass root.

cc @martin-luecke

@teqdruid
Copy link
Contributor

I think your analysis is spot on. If I am remembering correctly, @teqdruid and I hit exactly this issue a while ago, and nothing has changed in that regard. He is the real expert here, but this is something that we wanted to bring up in the MLIR workshop last week. I'll try to convey the solution in PyCDE, but maybe John can chime in

Yeah, I've run into this a number of times. It's extremely confusing to be debugging a Python class instance for op A only to learn that the MlirOperation is referring to a new op which is usually not even the same operation type! TBF, pointer invalidation is an issue in the C++ world as well.

Mike is essentially correct about the PyCDE solution. We try our best not to expose raw MLIR Python class instances to the user, but Python is Python so there's only so much one can do to hide things. IIRC I did add a call to MLIR to force the entire C++ (Pybind) cache to get invalidated which (IIRC) notifies the Python class. PyCDE invalidates its own cache (by which point the liveOperations cache should be empty), then for safety checks that it is indeed cleared, issues a warning if it isn't, then forces a cache invalidation. My memory is hazy since I did all this 1+ years ago. (I can barely remember what I had for breakfast.)

Flushing the entire cache is a bit heavy-handed. I once proposed that the Operation deconstructor should support calling a registered hook to inform anything that needs to know when an Operation has been erased. This method would work regardless of how said operation was deleted. IIRC the primary objection was performance-related -- adding a pointer null check is too much overhead in the common case wherein one wasn't using the callback. If callback was per-operation, that pointer addition to the Operation class would bloat the Operation memory. (I think it's more desirable to register it in the MLIRContext though.)

makslevental added a commit that referenced this issue Oct 21, 2023
Fixes #69730 (also see
https://reviews.llvm.org/D155543).

There are  two things outstanding (why I didn't land before):

1. add some C API tests for `mlirOperationWalk`;
2. potentially refactor how the invalidation in `run` works; the first
version of the code looked like this:
    ```cpp
    if (invalidateOps) {
      auto *context = op.getOperation().getContext().get();
      MlirOperationWalkCallback invalidatingCallback =
          [](MlirOperation op, void *userData) {
            PyMlirContext *context =
                static_cast<PyMlirContext *>(userData);
            context->setOperationInvalid(op);
          };
      auto numRegions =
          mlirOperationGetNumRegions(op.getOperation().get());
      for (int i = 0; i < numRegions; ++i) {
        MlirRegion region =
            mlirOperationGetRegion(op.getOperation().get(), i);
        for (MlirBlock block = mlirRegionGetFirstBlock(region);
             !mlirBlockIsNull(block);
             block = mlirBlockGetNextInRegion(block))
          for (MlirOperation childOp =
                   mlirBlockGetFirstOperation(block);
               !mlirOperationIsNull(childOp);
               childOp = mlirOperationGetNextInBlock(childOp))
            mlirOperationWalk(childOp, invalidatingCallback, context,
                              MlirWalkPostOrder);
      }
    }
    ```
This is verbose and ugly but it has the important benefit of not
executing `mlirOperationEqual(rootOp->get(), op)` for every op
underneath the root op.

Supposing there's no desire for the slightly more efficient but highly
convoluted approach, I can land this "posthaste".
But, since we have eyes on this now, any suggestions or approaches (or
needs/concerns) are welcome.
ingomueller-net added a commit to ingomueller-net/llvm-project that referenced this issue Oct 24, 2023
`PyOperations` are Python-level handles to `Operation *` instances. When
the latter are modified by C++, the former need to be invalidated.
 llvm#69746 implements such invalidation mechanism by setting all
`PyReferences` to `invalid`. However, that is not enough: they also need
to be removed from the `liveOperations` map since other parts of the
code (such as `PyOperation::createDetached`) assume that that map only
contains valid refs.

This is required to actually solve the issue in llvm#69730.
ingomueller-net added a commit that referenced this issue Oct 25, 2023
`PyOperations` are Python-level handles to `Operation *` instances. When
the latter are modified by C++, the former need to be invalidated.
#69746 implements such invalidation mechanism by setting all
`PyReferences` to `invalid`. However, that is not enough: they also need
to be removed from the `liveOperations` map since other parts of the
code (such as `PyOperation::createDetached`) assume that that map only
contains valid refs.

This is required to actually solve the issue in #69730.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mlir:python MLIR Python bindings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants