Skip to content

Plugin: Use mypy to enrich AST with types #12513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jepperaskdk opened this issue Apr 3, 2022 · 4 comments
Open

Plugin: Use mypy to enrich AST with types #12513

jepperaskdk opened this issue Apr 3, 2022 · 4 comments
Labels
feature topic-plugins The plugin API and ideas for new plugins

Comments

@jepperaskdk
Copy link

With the inspect/ast modules I can get an AST of e.g. a function and inspect variables - but in terms of types, I can only get the annotations from the signature (AFAIK).

Can mypy enrich the rest of the tree with types? And preferably expose this so that plugins can be written for it?

Does mypy already build a typed AST under the hood? Can you point me to where it happens?

Thank you

@JelleZijlstra JelleZijlstra added the topic-plugins The plugin API and ideas for new plugins label Apr 3, 2022
@devmessias
Copy link
Contributor

With the inspect/ast modules I can get an AST of e.g. a function and inspect variables - but in terms of types, I can only get the annotations from the signature (AFAIK).

Can mypy enrich the rest of the tree with types? And preferably expose this so that plugins can be written for it?

Does mypy already build a typed AST under the hood? Can you point me to where it happens?

Thank you

Seems related to #4868
Hi @jepperaskdk. Related to your questions:

  • "Can mypy enrich the rest of the tree with types? "

I think so. However, i don't know yet if this is trivial. I'm trying to figure out how to do it.

  • "And preferably expose this so that plugins can be written for it?"

There is no documentation to do it. But I believe this is not a good approach.

  • " Does mypy already build a typed AST under the hood?"

Mypy builds a tree with type information. But I don't know all the details

image

  • "Can you point me to where it happens?"

I don't know all the details. You should look into mypy/build.py. It seems that the mypy ast can be obtained through

)
import mypy.main as MAIN
import mypy.build as BUILD

py_file = "src/mod/test"

mod = py_file.replace("/", ".")

files, opt = MAIN.process_options([f"{py_file}.py"])
opt.preserve_asts = True
opt.fine_grained_incremental = True
result = BUILD.build(files, options=opt)
print(result.graph[mod].tree.__str__())

@devmessias
Copy link
Contributor

@jepperaskdk I'm trying to discover a way to feed a XML representing the python AST with the mypy type information. It seems that is not so trivial. I have the following hypothesis or doubts

  1. There is no map (with inverse) between nodes of python AST and mypy AST
  2. Mypy ast has a different structure

Thus, this turns enriching the AST with types a hard task. If you want to keep the same original structure

  1. Is harder navigate through the mypy ast

In the python ast module we can do the following

tree = ast.parse(txt, filename)
def transform(tree):
    node_fields = zip(
        node._fields, (getattr(node, attr) for attr in tree._fields))
    for field_name, field_value in node_fields:
        # stuff and recursion

I don't know how we can have a similar implementation using the result from mypy.build

image

If you want to attack this problem we can talk more. Maybe we can figure out how to achieve this

@devmessias
Copy link
Contributor

devmessias commented Jun 2, 2022

Ok, I tried to solve this through the mypy cache. But it seems that will not work because we don't have information about the token positions (like pyre does) or how this elements in the cache are related with the ast nodes or the tokens in the original source code.

This image shows how a FunctionDef node is stored in the mypy cache file
carbon (1)

Is there a method available in mypy that allows to relate the keys in a json cache file with the token positions of the source code file?

@devmessias
Copy link
Contributor

Strange, it seems that each cache for each file dosen't store any information about the types related with the variables inside each function.

rominf pushed a commit to rominf/LibCST that referenced this issue Dec 7, 2022
This change is RFC (please read whole change message).

Add `MypyTypeInferenceProvider` as an alternative for
`TypeInferenceProvider`. The provider infers types using mypy as
library. The only requirement for the usage is to have the latest mypy
installed. Types inferred are mypy types, since mypy type system is well
designed, to avoid the conversion, and also to keep it simple. For
compatibility and extensibility reasons, these types are stored in
separate field `MypyType.mypy_type`.

Let's assume we have the following code in the file `x.py` which we want
to inspect:
```python
x = [42]

s = set()

from enum import Enum

class E(Enum):
    f = "f"

e = E.f
```

Then to get play with mypy types one should use the code like:
```python
import libcst as cst

from libcst.metadata import MypyTypeInferenceProvider

filename = "x.py"
module = cst.parse_module(open(filename).read())
cache = MypyTypeInferenceProvider.gen_cache(".", [filename])[filename]
wrapper = cst.MetadataWrapper(
    module,
    cache={MypyTypeInferenceProvider: cache},
)

mypy_type = wrapper.resolve(MypyTypeInferenceProvider)
x_name_node = wrapper.module.body[0].body[0].targets[0].target
set_call_node = wrapper.module.body[1].body[0].value
e_name_node = wrapper.module.body[-1].body[0].targets[0].target

print(mypy_type[x_name_node])
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].fullname)
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].mypy_type.type.fullname)
 # prints: builtins.list

print(mypy_type[x_name_node].mypy_type.args)
 # prints: (builtins.int,)

print(mypy_type[x_name_node].mypy_type.type.bases[0].type.fullname)
 # prints: typing.MutableSequence

print(mypy_type[set_call_node])
 # prints: builtins.set

print("issuperset" in mypy_type[set_call_node].mypy_type.names)
 # prints: True

print(mypy_type[set_call_node.func])
 # prints: typing.Type[builtins.set]

print(mypy_type[e_name_node].mypy_type.type.is_enum)
 # prints: True
```

Why?

1. `TypeInferenceProvider` requires pyre (`pyre-check` on PyPI) to be
   installed. mypy is more popular than pyre. If the organization uses
   mypy already (which is almost always the case), it may be difficult
   to assure collegues (including security team) that "we need yet
   another type checker". `MypyTypeInferenceProvider` requires the
   latest mypy only.
2. Even though it is possible to run pyre without watchman installation,
   this is not advertised. watchman installation is not always possible
   because of system requirements, or because of the security
   requirements like "we install only our favorite GNU/Linux
   distribution packages".
3. `TypeInferenceProvider` usage requires `pyre start` command to be run
   before the execution, and `pyre stop` - after the execution. This may
   be inconvenient, especially for the cases when pyre was not used
   before.
4. Types produced by pyre in `TypeInferenceProvider` are just strings.
   For example, it's not easily possible to infer that some variable is
   enum instance. `MypyTypeInferenceProvider` makes it easy:
   ```
   [FIXME: code here]
   ```

Drawback:

1. Speed. mypy is slower than pyre, so is `MypyTypeInferenceProvider`
   comparing to `TypeInferenceProvider`.
   How to partially solve this:
   1. Implement AST tree caching in mypy. It may be difficult, however
      this will lead to speed improvements for all the projects that use
      this functionality.
   2. Implement inferred types caching inside LibCST. As far as I know,
      no caching at all is implemented inside LibCST, which is the
      prerequisite for inferred types caching, so the task is big.
   3. Implement LibCST CST to mypy AST. I am not sure if this possible
      at all. Even if it is possible, the task is huge.
2. Two providers are doing similar things in LibCST will be present,
   this can potentially lead to the situation when there is a need
   install two typecheckers to get all codemods from the library
   running.
   Alternatives considered:
   1. Put `MypyTypeInferenceProvider` inside separate library (say,
       LibCST-mypy or `libcst-mypy` on PyPI). This will explicitly
       separate `MypyTypeInferenceProvider` from the rest of LibCST.
      Drawbacks:
      1. The need to maintain separate library.
      2. Limited fame (people need to know that the library exists).
      3. Since some codemods cannot be implemented easily without the
         library, for example, `if-elif-else` to `match` converter
	 (it needs powerful type inference), they are doomed to not be
	 shipped with LibCST, which makes the latter less attractive for
	 end users.
   2. Implement base class for inferred type, which inherits from `str`
      (to keep the compatibility with the existing codebase) and
      the mechanism for dynamically selecting `TypeInferenceProvider`
      typechecker (mypy or pyre; user can do this via enviromental
      variable). If the code inside LibCST requires just shallow type
      information (so, just `str` is enough), then the code can run with
      any typechecker. Ther remaining code (such as `if-elif-else` to
      `match` converter) will still require mypy.

Misc:

Code does not lint in my env, by some reason `pyre check` cannot find
`mypy` library.

Related to:

* Instagram#451
* pyastrx/pyastrx#40
* python/mypy#12513
* python/mypy#4868
rominf pushed a commit to rominf/LibCST that referenced this issue Dec 7, 2022
This change is RFC (please read whole change message).

Add `MypyTypeInferenceProvider` as an alternative for
`TypeInferenceProvider`. The provider infers types using mypy as
library. The only requirement for the usage is to have the latest mypy
installed. Types inferred are mypy types, since mypy type system is well
designed, to avoid the conversion, and also to keep it simple. For
compatibility and extensibility reasons, these types are stored in
separate field `MypyType.mypy_type`.

Let's assume we have the following code in the file `x.py` which we want
to inspect:
```python
x = [42]

s = set()

from enum import Enum

class E(Enum):
    f = "f"

e = E.f
```

Then to get play with mypy types one should use the code like:
```python
import libcst as cst

from libcst.metadata import MypyTypeInferenceProvider

filename = "x.py"
module = cst.parse_module(open(filename).read())
cache = MypyTypeInferenceProvider.gen_cache(".", [filename])[filename]
wrapper = cst.MetadataWrapper(
    module,
    cache={MypyTypeInferenceProvider: cache},
)

mypy_type = wrapper.resolve(MypyTypeInferenceProvider)
x_name_node = wrapper.module.body[0].body[0].targets[0].target
set_call_node = wrapper.module.body[1].body[0].value
e_name_node = wrapper.module.body[-1].body[0].targets[0].target

print(mypy_type[x_name_node])
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].fullname)
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].mypy_type.type.fullname)
 # prints: builtins.list

print(mypy_type[x_name_node].mypy_type.args)
 # prints: (builtins.int,)

print(mypy_type[x_name_node].mypy_type.type.bases[0].type.fullname)
 # prints: typing.MutableSequence

print(mypy_type[set_call_node])
 # prints: builtins.set

print("issuperset" in mypy_type[set_call_node].mypy_type.names)
 # prints: True

print(mypy_type[set_call_node.func])
 # prints: typing.Type[builtins.set]

print(mypy_type[e_name_node].mypy_type.type.is_enum)
 # prints: True
```

Why?

1. `TypeInferenceProvider` requires pyre (`pyre-check` on PyPI) to be
   installed. mypy is more popular than pyre. If the organization uses
   mypy already (which is almost always the case), it may be difficult
   to assure collegues (including security team) that "we need yet
   another type checker". `MypyTypeInferenceProvider` requires the
   latest mypy only.
2. Even though it is possible to run pyre without watchman installation,
   this is not advertised. watchman installation is not always possible
   because of system requirements, or because of the security
   requirements like "we install only our favorite GNU/Linux
   distribution packages".
3. `TypeInferenceProvider` usage requires `pyre start` command to be run
   before the execution, and `pyre stop` - after the execution. This may
   be inconvenient, especially for the cases when pyre was not used
   before.
4. Types produced by pyre in `TypeInferenceProvider` are just strings.
   For example, it's not easily possible to infer that some variable is
   enum instance. `MypyTypeInferenceProvider` makes it easy, see the
   code above.

Drawback:

1. Speed. mypy is slower than pyre, so is `MypyTypeInferenceProvider`
   comparing to `TypeInferenceProvider`.
   How to partially solve this:
   1. Implement AST tree caching in mypy. It may be difficult, however
      this will lead to speed improvements for all the projects that use
      this functionality.
   2. Implement inferred types caching inside LibCST. As far as I know,
      no caching at all is implemented inside LibCST, which is the
      prerequisite for inferred types caching, so the task is big.
   3. Implement LibCST CST to mypy AST. I am not sure if this possible
      at all. Even if it is possible, the task is huge.
2. Two providers are doing similar things in LibCST will be present,
   this can potentially lead to the situation when there is a need
   install two typecheckers to get all codemods from the library
   running.
   Alternatives considered:
   1. Put `MypyTypeInferenceProvider` inside separate library (say,
       LibCST-mypy or `libcst-mypy` on PyPI). This will explicitly
       separate `MypyTypeInferenceProvider` from the rest of LibCST.
      Drawbacks:
      1. The need to maintain separate library.
      2. Limited fame (people need to know that the library exists).
      3. Since some codemods cannot be implemented easily without the
         library, for example, `if-elif-else` to `match` converter
	 (it needs powerful type inference), they are doomed to not be
	 shipped with LibCST, which makes the latter less attractive for
	 end users.
   2. Implement base class for inferred type, which inherits from `str`
      (to keep the compatibility with the existing codebase) and
      the mechanism for dynamically selecting `TypeInferenceProvider`
      typechecker (mypy or pyre; user can do this via enviromental
      variable). If the code inside LibCST requires just shallow type
      information (so, just `str` is enough), then the code can run with
      any typechecker. Ther remaining code (such as `if-elif-else` to
      `match` converter) will still require mypy.

Misc:

Code does not lint in my env, by some reason `pyre check` cannot find
`mypy` library.

Related to:

* Instagram#451
* pyastrx/pyastrx#40
* python/mypy#12513
* python/mypy#4868
rominf pushed a commit to rominf/LibCST that referenced this issue Dec 7, 2022
This change is RFC (please read whole change message).

Add `MypyTypeInferenceProvider` as an alternative for
`TypeInferenceProvider`. The provider infers types using mypy as
library. The only requirement for the usage is to have the latest mypy
installed. Types inferred are mypy types, since mypy type system is well
designed, to avoid the conversion, and also to keep it simple. For
compatibility and extensibility reasons, these types are stored in
separate field `MypyType.mypy_type`.

Let's assume we have the following code in the file `x.py` which we want
to inspect:
```python
x = [42]

s = set()

from enum import Enum

class E(Enum):
    f = "f"

e = E.f
```

Then to get play with mypy types one should use the code like:
```python
import libcst as cst

from libcst.metadata import MypyTypeInferenceProvider

filename = "x.py"
module = cst.parse_module(open(filename).read())
cache = MypyTypeInferenceProvider.gen_cache(".", [filename])[filename]
wrapper = cst.MetadataWrapper(
    module,
    cache={MypyTypeInferenceProvider: cache},
)

mypy_type = wrapper.resolve(MypyTypeInferenceProvider)
x_name_node = wrapper.module.body[0].body[0].targets[0].target
set_call_node = wrapper.module.body[1].body[0].value
e_name_node = wrapper.module.body[-1].body[0].targets[0].target

print(mypy_type[x_name_node])
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].fullname)
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].mypy_type.type.fullname)
 # prints: builtins.list

print(mypy_type[x_name_node].mypy_type.args)
 # prints: (builtins.int,)

print(mypy_type[x_name_node].mypy_type.type.bases[0].type.fullname)
 # prints: typing.MutableSequence

print(mypy_type[set_call_node])
 # prints: builtins.set

print("issuperset" in mypy_type[set_call_node].mypy_type.names)
 # prints: True

print(mypy_type[set_call_node.func])
 # prints: typing.Type[builtins.set]

print(mypy_type[e_name_node].mypy_type.type.is_enum)
 # prints: True
```

Why?

1. `TypeInferenceProvider` requires pyre (`pyre-check` on PyPI) to be
   installed. mypy is more popular than pyre. If the organization uses
   mypy already (which is almost always the case), it may be difficult
   to assure colleagues (including security team) that "we need yet
   another type checker". `MypyTypeInferenceProvider` requires the
   latest mypy only.
2. Even though it is possible to run pyre without watchman installation,
   this is not advertised. watchman installation is not always possible
   because of system requirements, or because of the security
   requirements like "we install only our favorite GNU/Linux
   distribution packages".
3. `TypeInferenceProvider` usage requires `pyre start` command to be run
   before the execution, and `pyre stop` - after the execution. This may
   be inconvenient, especially for the cases when pyre was not used
   before.
4. Types produced by pyre in `TypeInferenceProvider` are just strings.
   For example, it's not easily possible to infer that some variable is
   enum instance. `MypyTypeInferenceProvider` makes it easy, see the
   code above.

Drawback:

1. Speed. mypy is slower than pyre, so is `MypyTypeInferenceProvider`
   comparing to `TypeInferenceProvider`.
   How to partially solve this:
   1. Implement AST tree caching in mypy. It may be difficult, however
      this will lead to speed improvements for all the projects that use
      this functionality.
   2. Implement inferred types caching inside LibCST. As far as I know,
      no caching at all is implemented inside LibCST, which is the
      prerequisite for inferred types caching, so the task is big.
   3. Implement LibCST CST to mypy AST. I am not sure if this possible
      at all. Even if it is possible, the task is huge.
2. Two providers are doing similar things in LibCST will be present,
   this can potentially lead to the situation when there is a need
   install two typecheckers to get all codemods from the library
   running.
   Alternatives considered:
   1. Put `MypyTypeInferenceProvider` inside separate library (say,
       LibCST-mypy or `libcst-mypy` on PyPI). This will explicitly
       separate `MypyTypeInferenceProvider` from the rest of LibCST.
      Drawbacks:
      1. The need to maintain separate library.
      2. Limited fame (people need to know that the library exists).
      3. Since some codemods cannot be implemented easily without the
         library, for example, `if-elif-else` to `match` converter
	 (it needs powerful type inference), they are doomed to not be
	 shipped with LibCST, which makes the latter less attractive for
	 end users.
   2. Implement base class for inferred type, which inherits from `str`
      (to keep the compatibility with the existing codebase) and
      the mechanism for dynamically selecting `TypeInferenceProvider`
      typechecker (mypy or pyre; user can do this via enviromental
      variable). If the code inside LibCST requires just shallow type
      information (so, just `str` is enough), then the code can run with
      any typechecker. The remaining code (such as `if-elif-else` to
      `match` converter) will still require mypy.

Misc:

Code does not lint in my env, by some reason `pyre check` cannot find
`mypy` library.

Related to:

* Instagram#451
* pyastrx/pyastrx#40
* python/mypy#12513
* python/mypy#4868
rominf pushed a commit to rominf/LibCST that referenced this issue Dec 7, 2022
This change is RFC (please read whole change message).

Add `MypyTypeInferenceProvider` as an alternative for
`TypeInferenceProvider`. The provider infers types using mypy as
library. The only requirement for the usage is to have the latest mypy
installed. Types inferred are mypy types, since mypy type system is well
designed, to avoid the conversion, and also to keep it simple. For
compatibility and extensibility reasons, these types are stored in
separate field `MypyType.mypy_type`.

Let's assume we have the following code in the file `x.py` which we want
to inspect:
```python
x = [42]

s = set()

from enum import Enum

class E(Enum):
    f = "f"

e = E.f
```

Then to get play with mypy types one should use the code like:
```python
import libcst as cst

from libcst.metadata import MypyTypeInferenceProvider

filename = "x.py"
module = cst.parse_module(open(filename).read())
cache = MypyTypeInferenceProvider.gen_cache(".", [filename])[filename]
wrapper = cst.MetadataWrapper(
    module,
    cache={MypyTypeInferenceProvider: cache},
)

mypy_type = wrapper.resolve(MypyTypeInferenceProvider)
x_name_node = wrapper.module.body[0].body[0].targets[0].target
set_call_node = wrapper.module.body[1].body[0].value
e_name_node = wrapper.module.body[-1].body[0].targets[0].target

print(mypy_type[x_name_node])
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].fullname)
 # prints: builtins.list[builtins.int]

print(mypy_type[x_name_node].mypy_type.type.fullname)
 # prints: builtins.list

print(mypy_type[x_name_node].mypy_type.args)
 # prints: (builtins.int,)

print(mypy_type[x_name_node].mypy_type.type.bases[0].type.fullname)
 # prints: typing.MutableSequence

print(mypy_type[set_call_node])
 # prints: builtins.set

print("issuperset" in mypy_type[set_call_node].mypy_type.names)
 # prints: True

print(mypy_type[set_call_node.func])
 # prints: typing.Type[builtins.set]

print(mypy_type[e_name_node].mypy_type.type.is_enum)
 # prints: True
```

Why?

1. `TypeInferenceProvider` requires pyre (`pyre-check` on PyPI) to be
   installed. mypy is more popular than pyre. If the organization uses
   mypy already (which is almost always the case), it may be difficult
   to assure colleagues (including security team) that "we need yet
   another type checker". `MypyTypeInferenceProvider` requires the
   latest mypy only.
2. Even though it is possible to run pyre without watchman installation,
   this is not advertised. watchman installation is not always possible
   because of system requirements, or because of the security
   requirements like "we install only our favorite GNU/Linux
   distribution packages".
3. `TypeInferenceProvider` usage requires `pyre start` command to be run
   before the execution, and `pyre stop` - after the execution. This may
   be inconvenient, especially for the cases when pyre was not used
   before.
4. Types produced by pyre in `TypeInferenceProvider` are just strings.
   For example, it's not easily possible to infer that some variable is
   enum instance. `MypyTypeInferenceProvider` makes it easy, see the
   code above.

Drawbacks:

1. Speed. mypy is slower than pyre, so is `MypyTypeInferenceProvider`
   comparing to `TypeInferenceProvider`.
   How to partially solve this:
   1. Implement AST tree caching in mypy. It may be difficult, however
      this will lead to speed improvements for all the projects that use
      this functionality.
   2. Implement inferred types caching inside LibCST. As far as I know,
      no caching at all is implemented inside LibCST, which is the
      prerequisite for inferred types caching, so the task is big.
   3. Implement LibCST CST to mypy AST. I am not sure if this possible
      at all. Even if it is possible, the task is huge.
2. Two providers are doing similar things in LibCST will be present,
   this can potentially lead to the situation when there is a need
   install two typecheckers to get all codemods from the library
   running.
   Alternatives considered:
   1. Put `MypyTypeInferenceProvider` inside separate library (say,
       LibCST-mypy or `libcst-mypy` on PyPI). This will explicitly
       separate `MypyTypeInferenceProvider` from the rest of LibCST.
      Drawbacks:
      1. The need to maintain separate library.
      2. Limited fame (people need to know that the library exists).
      3. Since some codemods cannot be implemented easily without the
         library, for example, `if-elif-else` to `match` converter
	 (it needs powerful type inference), they are doomed to not be
	 shipped with LibCST, which makes the latter less attractive for
	 end users.
   2. Implement base class for inferred type, which inherits from `str`
      (to keep the compatibility with the existing codebase) and
      the mechanism for dynamically selecting `TypeInferenceProvider`
      typechecker (mypy or pyre; user can do this via enviromental
      variable). If the code inside LibCST requires just shallow type
      information (so, just `str` is enough), then the code can run with
      any typechecker. The remaining code (such as `if-elif-else` to
      `match` converter) will still require mypy.

Misc:

Code does not lint in my env, by some reason `pyre check` cannot find
`mypy` library.

Related to:

* Instagram#451
* pyastrx/pyastrx#40
* python/mypy#12513
* python/mypy#4868
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature topic-plugins The plugin API and ideas for new plugins
Projects
None yet
Development

No branches or pull requests

3 participants