Skip to content

Turn isna() and notna() into TypeGuards #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 13, 2022
Merged

Conversation

gandhis1
Copy link
Contributor

  • Tests added: Please use assert_type() to assert the type of any return value

Putting up a draft first because I actually am not quite sure what the best way to go about this is. I wrote the tests first, as to what the behavior intuitively should be, but getting the annotations to match is the challenging part.

Further, I'm still not perfectly happy with these annotations, because I have seen tons of code use pd.isnull on any random object, and the annotations currently do not support this, they only really allow pandas objects and a curated list of Scalar objects.

@overload
def isna(obj: Scalar | NAType) -> TypeGuard[NAType]: ...
@overload
def isna(obj: Scalar | None) -> TypeGuard[None]: ...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are overlapping overloads. But what's the right way to do this? If you pass a ScalarT | None, the type guard should only ever give you a None, it should never give you a NAType back.

If I remove the last 3 overloads, then you always get back a TypeGuard[NaTType | NAType | None]....which isn't correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, (but not sure), based on my reading of the docs, that you don't need all the overloads. If you do

def isna(
    obj:  NaTType | NAType | None,
) -> TypeGuard[NaTType | NAType | None]: ...

that tells the type checker that isna() returns True for just those types, and False for anything else.

What's not clear to me is how to use (and whether we can use) assert_type() here. Functions annotated as TypeGuard always return bool, but what will assert_type() do, both for the type checker and at runtime?

Keep experimenting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's not clear to me is how to use (and whether we can use) assert_type() here. Functions annotated as TypeGuard always return bool, but what will assert_type() do, both for the type checker and at runtime?

Could probably have a test like

if not pd.isna(None):
    some_terrible_type_error  # type: ignore[some-unused-code-error] # but not the terrible error which should not be analyzed by mypy/pyright

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the following simple test:

from typing_extensions import assert_type, TypeGuard


def isastr(s: str) -> TypeGuard[str]:
    return isinstance(s, str)


assert_type(isastr("abc"), bool)

assert_type(isastr(3.4), bool)

print(assert_type(isastr("asdf"), bool))

Both pyright and mypy indicate that isastr(3.4) is invalid.

At runtime, assert_type(isastr("asdf"), bool) returns True

So for testing purposes, we can use check(assert_type(pd.isna(foo), bool), True) where foo are None, pd.NA, pd.NaT, np.nan.

Then could do something like check(assert_type(pd.isna("abc"), bool), False) # type: ignore[arg-type] to check that we are catching invalid values.

Copy link
Contributor Author

@gandhis1 gandhis1 Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start, the part I'm not following is why we are trying to check the type of the return of the function itself. The new tests I have added to this PR don't bother doing this because that's more or less already covered by existing tests. Instead, they check the type of the object that was passed to pd.isna. Because that is the object whose type should be influenced by the TypeGuard.

The limiting factor I see right now is that yes, pd.isna has a variable return type, and TypeGuard always needs to return bool. But this seems to me a glaring gap in the capability of the type checker - we should be able to narrow types.

If I have this code, which is essentially the same as the test I added:

x = random.choice("value", None)  # Optional[str]
if pd.notnull(x):
     ...  # x is guaranteed to be a str

We know for a fact that x is a str in the body of the if statement. I am not aware of any exceptions to that. Anyway, rather than giving up here due to type checking limitations, I'm wondering if this is something worth escalating to the typing mailing list. Is the restriction that TypeGuard always returns a bool actually necessary? Or can we support @overload with TypeGuard?

The second, somewhat separate issue, is how the narrowing works. I have a test case illustrating this as well. But to copy the example, this first block of code already works as intended and passes MyPy at least:

x = random.choice("value", None, pd.NA)  # Union[str, None, pd.NA]
if pd.isnull(x):
    ...  # x is Union[None, pd.NA]

The above code works because I annotated all of the null-equivalent types in a single Union. However, if you do that, then this following example does NOT work. It should work. But it doesn't:

x = random.choice("value", None)  # Union[str, None]
if pd.isnull(x):
    ...  # x should be None, but because we had a single TypeGuard overload that covered pd.NA as well, this is actually Union[None, pd.NA]

Now the question I was asking myself, so is the solution to just have separate overloads for each? But you can't, because that overlaps with the original broader overload, and pyright complains about overlapping overloads.

So that's essentially where I left off.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To start, the part I'm not following is why we are trying to check the type of the return of the function itself.

This is the methodology we've come up with to test the stubs. Especially in the case of overloaded functions/methods, we want to make sure that we are getting the expected result for each possible input.

The limiting factor I see right now is that yes, pd.isna has a variable return type, and TypeGuard always needs to return bool. But this seems to me a glaring gap in the capability of the type checker - we should be able to narrow types.

Is it the capability of the type checker, or the capability of the spec for TypeGuard ?

We know for a fact that x is a str in the body of the if statement. I am not aware of any exceptions to that. Anyway, rather than giving up here due to type checking limitations, I'm wondering if this is something worth escalating to the typing mailing list. Is the restriction that TypeGuard always returns a bool actually necessary? Or can we support @overload with TypeGuard?

Yes, worth bringing up with the typing gods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the methodology we've come up with to test the stubs. Especially in the case of overloaded functions/methods, we want to make sure that we are getting the expected result for each possible input.

I don't think that methodology is by itself sufficient for testing a TypeGuard. TypeGuard changes the type of the object passed to it - and that's what I've been trying to test here. It's arguably the most relevant aspect to test, otherwise a TypeGuard is no better than a bool return. The existing tests, which are associated with the previous bool return, still work. To the extent that the previous tests were incomplete I can fill in the gaps. Although at this point seems like this whole thing is on hold since TypeGuard in its current state doesn't seem to work here.

Is it the capability of the type checker, or the capability of the spec for TypeGuard ?

Seems like the spec to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, worth bringing up with the typing gods.

I'm not sure how quick the typing mailing list is, but the discussion page of pyright might be a first start where you get very quickly feedback.

@gandhis1
Copy link
Contributor Author

Closing this for now, until TypeGuard is compatible with overloads, this is not viable.

@gandhis1 gandhis1 closed this Sep 29, 2022
@gandhis1
Copy link
Contributor Author

gandhis1 commented Oct 5, 2022

Reopening

@gandhis1 gandhis1 reopened this Oct 5, 2022
@gandhis1 gandhis1 marked this pull request as ready for review October 5, 2022 15:48
assert check(assert_type(pd.isna(None), Literal[True]), bool)
assert not check(assert_type(pd.notna(None), Literal[False]), bool)
assert check(assert_type(pd.isna(None), bool), bool)
assert not check(assert_type(pd.notna(None), bool), bool)

check(assert_type(pd.isna(2.5), bool), bool)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this assert not check(assert_type(pd.isna(2.5), bool), bool), and something similar on the next line.

Just want to make sure we check all the results given the use of TypeGuard

assert_type(nullable3, Union[bool, NAType, None]) # TODO: Mypy result
assert_type(
nullable3, Union[bool, NaTType, NAType, None]
) # TODO: Pyright result
Copy link
Contributor Author

@gandhis1 gandhis1 Oct 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lot more test cases and tested the negative condition as well. I also documented what the correct type should be - which isn't always supported by the way TypeGuard currently works (see comments on StrictTypeGuard.

However, this uncovered conflicting behavior in Mypy and Pyright. My 2 cents on this, is that I believe Mypy is incorrect, and Pyright is correct. This is just an extension on how TypeGuard generally behaves - a simple type narrowing translation from A to B, with no regard for unioned types and set arithmetic, and no requirement that the type be strictly narrowed. And as long as that's true, then Mypy shouldn't be narrowing beyond that, as it does on line 203 and 183 for example.

# and as a result the type narrowing does not always work as it intuitively should
# There is a proposal being floated for a StrictTypeGuard that will have more rigid narrowing semantics
# In the test cases below, a commented out assertion will be included to document the optimal test result
nullable1 = random.choice(["value", None, pd.NA, pd.NaT])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that using random.choice is the right thing to use here. Consider this simple script:

import random

x = random.choice([1, "abc", None])
reveal_type(x)

pyright reports int | str | None whereas mypy report builtins.object

So already you have a different kind of test happening based on whether pyright or mypy is being used.

Maybe you have to add a type to nullable1 as in:

nulllable1: Union[str, None, pd.NAType, pd.NaType] = random.choice(["value", None, pd.NA, pd.NaT])

to avoid this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I do see that behavior in your example, I am assuming probably due to the two types of built-in scalars involved, however the examples I have chosen do not have this issue:

MyPy:

tests/test_pandas.py:383: note: Revealed type is "Union[pandas._libs.tslibs.nattype.NaTType, pandas._libs.missing.NAType, builtins.str, None]"
tests/test_pandas.py:385: note: Revealed type is "Union[builtins.int, None]"
tests/test_pandas.py:387: note: Revealed type is "Union[pandas._libs.missing.NAType, builtins.bool, None]"

Pyright:

  tests/test_pandas.py:383:17 - information: Type of "nullable1" is "str | NAType | NaTType | None"
  tests/test_pandas.py:385:17 - information: Type of "nullable2" is "int | None"
  tests/test_pandas.py:387:17 - information: Type of "nullable3" is "bool | NAType | None"

I can still annotate it explicitly if that makes the test a bit more clear, but the conflict between MyPy / Pyright is after this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an explicit annotation would be best here, because my simple example shows something funky going on, and I'm concerned if we make a modification in the future, we won't catch it.

@gandhis1
Copy link
Contributor Author

gandhis1 commented Oct 7, 2022

TODO: One, I probably will have to just silence the conflicting test by commenting it out / removing it. That said, MyPy and Pyright ought to have the same behavior, so I'll bring it up on Pyright's forums first. Given that I suspect Pyright is right here, may end up posting it as a Mypy issue. In any case, whichever type checker is wrong, probably isn't going to get fixed any soon, hence why I will silence it.

# assert_type(nullable3, Union[NAType, None])
assert_type(nullable3, Union[bool, NAType, None]) # TODO: Mypy result
assert_type(
nullable3, Union[bool, NaTType, NAType, None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently failing here because nullable3 is of type Union[bool, NAType, None]

# check(assert_type(nullable2, None), type(None))
assert_type(nullable2, Union[int, None]) # TODO: MyPy result
assert_type(
nullable2, Union[int, NaTType, NAType, None]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently failing here because nullable2 is Union[int, None] (although it should be None based on the test of pd.notna() )

@gandhis1
Copy link
Contributor Author

gandhis1 commented Oct 12, 2022

So I ended up commenting out the tests that had conflicting behavior. In summary:

  • The current spec of TypeGuard does not enable as precise type narrowing as would be ideal. This is unlikely to be remedied anytime soon, as I'd imagine this is something a PEP would be needed to address.
  • I took the step of annotating as a comment what the correct/optimal type assertion should be. These remain commented solely for documentation for now, although if and when in the future they become viable, they can be used as actual assertions.
  • There are also certain cases in which MyPy and Pyright produce conflicting behavior. These have likewise been commented out for documentation purposes. If, in the future, the two type checkers converge to the same behavior, we may want to enable these assertions.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gandhis1

Just a note. Pep 647 says the following about the negative case: "User-defined type guards apply narrowing only in the positive case (the if clause). The type is not narrowed in the negative case."

So that explains why our "ideal" test in the else clauses isn't happening.

@Dr-Irv Dr-Irv merged commit 617529a into pandas-dev:main Oct 13, 2022
@gandhis1
Copy link
Contributor Author

So I upgraded to the latest version in conda today. It's definitely an improvement as this uncovered a new legitimate type error in my code base. Unfortunately, it's not perfect due to the lack of strict type narrowing semantics, as well as lack of type narrowing in the negative case. I hope StrictTypeGuard gains some traction to improve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants