-
-
Notifications
You must be signed in to change notification settings - Fork 143
Turn isna() and notna() into TypeGuards #339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas-stubs/core/dtypes/missing.pyi
Outdated
@overload | ||
def isna(obj: Scalar | NAType) -> TypeGuard[NAType]: ... | ||
@overload | ||
def isna(obj: Scalar | None) -> TypeGuard[None]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are overlapping overloads. But what's the right way to do this? If you pass a ScalarT | None
, the type guard should only ever give you a None
, it should never give you a NAType
back.
If I remove the last 3 overloads, then you always get back a TypeGuard[NaTType | NAType | None]
....which isn't correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, (but not sure), based on my reading of the docs, that you don't need all the overloads. If you do
def isna(
obj: NaTType | NAType | None,
) -> TypeGuard[NaTType | NAType | None]: ...
that tells the type checker that isna()
returns True
for just those types, and False
for anything else.
What's not clear to me is how to use (and whether we can use) assert_type()
here. Functions annotated as TypeGuard
always return bool
, but what will assert_type()
do, both for the type checker and at runtime?
Keep experimenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's not clear to me is how to use (and whether we can use)
assert_type()
here. Functions annotated asTypeGuard
always returnbool
, but what willassert_type()
do, both for the type checker and at runtime?
Could probably have a test like
if not pd.isna(None):
some_terrible_type_error # type: ignore[some-unused-code-error] # but not the terrible error which should not be analyzed by mypy/pyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the following simple test:
from typing_extensions import assert_type, TypeGuard
def isastr(s: str) -> TypeGuard[str]:
return isinstance(s, str)
assert_type(isastr("abc"), bool)
assert_type(isastr(3.4), bool)
print(assert_type(isastr("asdf"), bool))
Both pyright
and mypy
indicate that isastr(3.4)
is invalid.
At runtime, assert_type(isastr("asdf"), bool)
returns True
So for testing purposes, we can use check(assert_type(pd.isna(foo), bool), True)
where foo
are None
, pd.NA
, pd.NaT
, np.nan
.
Then could do something like check(assert_type(pd.isna("abc"), bool), False) # type: ignore[arg-type]
to check that we are catching invalid values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To start, the part I'm not following is why we are trying to check the type of the return of the function itself. The new tests I have added to this PR don't bother doing this because that's more or less already covered by existing tests. Instead, they check the type of the object that was passed to pd.isna
. Because that is the object whose type should be influenced by the TypeGuard
.
The limiting factor I see right now is that yes, pd.isna
has a variable return type, and TypeGuard
always needs to return bool
. But this seems to me a glaring gap in the capability of the type checker - we should be able to narrow types.
If I have this code, which is essentially the same as the test I added:
x = random.choice("value", None) # Optional[str]
if pd.notnull(x):
... # x is guaranteed to be a str
We know for a fact that x is a str
in the body of the if
statement. I am not aware of any exceptions to that. Anyway, rather than giving up here due to type checking limitations, I'm wondering if this is something worth escalating to the typing mailing list. Is the restriction that TypeGuard
always returns a bool
actually necessary? Or can we support @overload
with TypeGuard
?
The second, somewhat separate issue, is how the narrowing works. I have a test case illustrating this as well. But to copy the example, this first block of code already works as intended and passes MyPy at least:
x = random.choice("value", None, pd.NA) # Union[str, None, pd.NA]
if pd.isnull(x):
... # x is Union[None, pd.NA]
The above code works because I annotated all of the null-equivalent types in a single Union
. However, if you do that, then this following example does NOT work. It should work. But it doesn't:
x = random.choice("value", None) # Union[str, None]
if pd.isnull(x):
... # x should be None, but because we had a single TypeGuard overload that covered pd.NA as well, this is actually Union[None, pd.NA]
Now the question I was asking myself, so is the solution to just have separate overloads for each? But you can't, because that overlaps with the original broader overload, and pyright complains about overlapping overloads.
So that's essentially where I left off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To start, the part I'm not following is why we are trying to check the type of the return of the function itself.
This is the methodology we've come up with to test the stubs. Especially in the case of overloaded functions/methods, we want to make sure that we are getting the expected result for each possible input.
The limiting factor I see right now is that yes,
pd.isna
has a variable return type, andTypeGuard
always needs to returnbool
. But this seems to me a glaring gap in the capability of the type checker - we should be able to narrow types.
Is it the capability of the type checker, or the capability of the spec for TypeGuard
?
We know for a fact that x is a
str
in the body of theif
statement. I am not aware of any exceptions to that. Anyway, rather than giving up here due to type checking limitations, I'm wondering if this is something worth escalating to the typing mailing list. Is the restriction thatTypeGuard
always returns abool
actually necessary? Or can we support@overload
withTypeGuard
?
Yes, worth bringing up with the typing gods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the methodology we've come up with to test the stubs. Especially in the case of overloaded functions/methods, we want to make sure that we are getting the expected result for each possible input.
I don't think that methodology is by itself sufficient for testing a TypeGuard
. TypeGuard
changes the type of the object passed to it - and that's what I've been trying to test here. It's arguably the most relevant aspect to test, otherwise a TypeGuard
is no better than a bool
return. The existing tests, which are associated with the previous bool
return, still work. To the extent that the previous tests were incomplete I can fill in the gaps. Although at this point seems like this whole thing is on hold since TypeGuard
in its current state doesn't seem to work here.
Is it the capability of the type checker, or the capability of the spec for TypeGuard ?
Seems like the spec to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, worth bringing up with the typing gods.
I'm not sure how quick the typing mailing list is, but the discussion page of pyright might be a first start where you get very quickly feedback.
Closing this for now, until |
Reopening |
tests/test_pandas.py
Outdated
assert check(assert_type(pd.isna(None), Literal[True]), bool) | ||
assert not check(assert_type(pd.notna(None), Literal[False]), bool) | ||
assert check(assert_type(pd.isna(None), bool), bool) | ||
assert not check(assert_type(pd.notna(None), bool), bool) | ||
|
||
check(assert_type(pd.isna(2.5), bool), bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this assert not check(assert_type(pd.isna(2.5), bool), bool)
, and something similar on the next line.
Just want to make sure we check all the results given the use of TypeGuard
tests/test_pandas.py
Outdated
assert_type(nullable3, Union[bool, NAType, None]) # TODO: Mypy result | ||
assert_type( | ||
nullable3, Union[bool, NaTType, NAType, None] | ||
) # TODO: Pyright result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a lot more test cases and tested the negative condition as well. I also documented what the correct type should be - which isn't always supported by the way TypeGuard
currently works (see comments on StrictTypeGuard
.
However, this uncovered conflicting behavior in Mypy and Pyright. My 2 cents on this, is that I believe Mypy is incorrect, and Pyright is correct. This is just an extension on how TypeGuard generally behaves - a simple type narrowing translation from A to B, with no regard for unioned types and set arithmetic, and no requirement that the type be strictly narrowed. And as long as that's true, then Mypy shouldn't be narrowing beyond that, as it does on line 203 and 183 for example.
tests/test_pandas.py
Outdated
# and as a result the type narrowing does not always work as it intuitively should | ||
# There is a proposal being floated for a StrictTypeGuard that will have more rigid narrowing semantics | ||
# In the test cases below, a commented out assertion will be included to document the optimal test result | ||
nullable1 = random.choice(["value", None, pd.NA, pd.NaT]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that using random.choice
is the right thing to use here. Consider this simple script:
import random
x = random.choice([1, "abc", None])
reveal_type(x)
pyright reports int | str | None
whereas mypy report builtins.object
So already you have a different kind of test happening based on whether pyright or mypy is being used.
Maybe you have to add a type to nullable1
as in:
nulllable1: Union[str, None, pd.NAType, pd.NaType] = random.choice(["value", None, pd.NA, pd.NaT])
to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I do see that behavior in your example, I am assuming probably due to the two types of built-in scalars involved, however the examples I have chosen do not have this issue:
MyPy:
tests/test_pandas.py:383: note: Revealed type is "Union[pandas._libs.tslibs.nattype.NaTType, pandas._libs.missing.NAType, builtins.str, None]"
tests/test_pandas.py:385: note: Revealed type is "Union[builtins.int, None]"
tests/test_pandas.py:387: note: Revealed type is "Union[pandas._libs.missing.NAType, builtins.bool, None]"
Pyright:
tests/test_pandas.py:383:17 - information: Type of "nullable1" is "str | NAType | NaTType | None"
tests/test_pandas.py:385:17 - information: Type of "nullable2" is "int | None"
tests/test_pandas.py:387:17 - information: Type of "nullable3" is "bool | NAType | None"
I can still annotate it explicitly if that makes the test a bit more clear, but the conflict between MyPy / Pyright is after this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an explicit annotation would be best here, because my simple example shows something funky going on, and I'm concerned if we make a modification in the future, we won't catch it.
TODO: One, I probably will have to just silence the conflicting test by commenting it out / removing it. That said, MyPy and Pyright ought to have the same behavior, so I'll bring it up on Pyright's forums first. Given that I suspect Pyright is right here, may end up posting it as a Mypy issue. In any case, whichever type checker is wrong, probably isn't going to get fixed any soon, hence why I will silence it. |
tests/test_pandas.py
Outdated
# assert_type(nullable3, Union[NAType, None]) | ||
assert_type(nullable3, Union[bool, NAType, None]) # TODO: Mypy result | ||
assert_type( | ||
nullable3, Union[bool, NaTType, NAType, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently failing here because nullable3
is of type Union[bool, NAType, None]
tests/test_pandas.py
Outdated
# check(assert_type(nullable2, None), type(None)) | ||
assert_type(nullable2, Union[int, None]) # TODO: MyPy result | ||
assert_type( | ||
nullable2, Union[int, NaTType, NAType, None] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently failing here because nullable2
is Union[int, None]
(although it should be None
based on the test of pd.notna()
)
…yright Keep the tests as comments in order to document the behavior for future reference
So I ended up commenting out the tests that had conflicting behavior. In summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gandhis1
Just a note. Pep 647 says the following about the negative case: "User-defined type guards apply narrowing only in the positive case (the if clause). The type is not narrowed in the negative case."
So that explains why our "ideal" test in the else
clauses isn't happening.
So I upgraded to the latest version in |
assert_type()
to assert the type of any return valuePutting up a draft first because I actually am not quite sure what the best way to go about this is. I wrote the tests first, as to what the behavior intuitively should be, but getting the annotations to match is the challenging part.
Further, I'm still not perfectly happy with these annotations, because I have seen tons of code use
pd.isnull
on any random object, and the annotations currently do not support this, they only really allowpandas
objects and a curated list ofScalar
objects.