Skip to content

add proper type when grouping by a Series #708

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 2, 2023

Conversation

ClementPinard
Copy link
Contributor

@ClementPinard ClementPinard commented May 24, 2023

Work in Progress, No test yet

I added a new TypeVar that is the intersection of S1 and ByT.
The type hint for S1 types outside of ByT is not covered anymore, but I suppose that it would theoretically fail in pandas, hence its absence of ByT

@ClementPinard
Copy link
Contributor Author

Just uploaded tests for iteration over groupby objects.
Note that test are falling now, probably because the typer fails to identify a Series[bool] and thus the correspondig. Would love some guidance over this.

@twoertwein
Copy link
Member

Would love some guidance over this.

It looks as if it should work :) I would start with adding reveal_type(...) on each line (and break complex statements up) to see which type mypy and pyright infer. The CI stopped after the mypy failure, locally, you can run poe pyright to see whether pyright is happy.

@ClementPinard
Copy link
Contributor Author

My vscode was happy so I guess pyright was happy.

However, mypy do fail on local too.

I'll do some additional investigation.

@ClementPinard
Copy link
Contributor Author

Interesting take with reveal_types :

def test_types_groupby_iter() -> None:
    s = pd.Series([1, 1, 2])
    series_groupby: pd.Series[bool] = pd.Series([True, True, False])
    first_group = next(iter(s.groupby(series_groupby)))
    reveal_type(s.groupby(series_groupby))
    reveal_type(s.groupby(series_groupby).__iter__())
    reveal_type(iter(s.groupby(series_groupby)))

Outputs

tests/test_series.py:738: note: Revealed type is "pandas.core.groupby.generic.SeriesGroupBy[Any, builtins.bool]"
tests/test_series.py:739: note: Revealed type is "typing.Iterator[Tuple[builtins.bool, pandas.core.series.Series[Any]]]"
tests/test_series.py:740: note: Revealed type is "typing.Iterator[Any]"

How can the type of iter(something) be different from something.__iter__ ? Is there some double check I should do ?

Even more puzzling is that the type is right for DataFrames :

def test_types_groupby_iter() -> None:
    df = pd.DataFrame(data={"col1": [1, 1, 2], "col2": [3, 4, 5]})
    series_groupby: pd.Series[bool] = pd.Series([True, True, False])
    first_group = next(iter(df.groupby(series_groupby)))
    reveal_type(df.groupby(series_groupby))
    reveal_type(df.groupby(series_groupby).__iter__())
    reveal_type(iter(df.groupby(series_groupby)))

output

tests/test_frame.py:992: note: Revealed type is "pandas.core.groupby.generic.DataFrameGroupBy[builtins.bool]"
tests/test_frame.py:993: note: Revealed type is "typing.Iterator[Tuple[builtins.bool, pandas.core.frame.DataFrame]]"
tests/test_frame.py:994: note: Revealed type is "typing.Iterator[Tuple[builtins.bool, pandas.core.frame.DataFrame]]"

But I don't see any particular additional type hint DataFrameGroupBy regarding iteration compared to SeriesGroupBy ...

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 24, 2023

How can the type of iter(something) be different from something.__iter__ ? Is there some double check I should do ?

I think the issue is that when you are doing s.groupby(series_groupby), s is Series[Any] because there was no dtype information supplied. If you had created s=pd.Series([1, 1, 2], dtype=int), then you'd get the results you expect.

NOTE - my ability to help out will be pretty limited between now (5/24) and 6/4 due to vacation.

@ClementPinard
Copy link
Contributor Author

You are right, mypy now works !

Did not know the dtype was useful to help the type checker.

@twoertwein
Copy link
Member

NOTE - my ability to help out will be pretty limited between now (5/24) and 6/4 due to vacation.

I will merge low-risk PRs during that timeframe.

@gandhis1
Copy link
Contributor

How can the type of iter(something) be different from something.__iter__ ? Is there some double check I should do ?

I think the issue is that when you are doing s.groupby(series_groupby), s is Series[Any] because there was no dtype information supplied. If you had created s=pd.Series([1, 1, 2], dtype=int), then you'd get the results you expect.

NOTE - my ability to help out will be pretty limited between now (5/24) and 6/4 due to vacation.

Maybe this is deserving of a separate issue, but shouldn't we be able to create an initializer overload that takes a list[T] and returns a pd.Series[T]?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 25, 2023

Maybe this is deserving of a separate issue, but shouldn't we be able to create an initializer overload that takes a list[T] and returns a pd.Series[T]?

Can you create a separate issue? I think this might work.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small things from me, plus address issue to add comment about SeriesByT

Comment on lines 993 to 997
assert_type(first_group[0], "bool"),
bool,
)
check(
assert_type(first_group[1], "pd.DataFrame"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need quotes around the types in the two assert_type() statements, because those are valid types at runtime.

series_groupby = pd.Series([True, True, False], dtype=bool)
first_group = next(iter(s.groupby(series_groupby)))
check(
assert_type(first_group[0], "bool"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no quotes needed here

Comment on lines 743 to 744
assert_type(first_group[1], "pd.Series[int]"),
pd.Series,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_type(first_group[1], "pd.Series[int]"),
pd.Series,
assert_type(first_group[1], "pd.Series[int]"),
pd.Series, np.integer

You do need quotes here because pd.Series[int] is not a type at runtime.

@ClementPinard
Copy link
Contributor Author

Thank you for the review, I do believe I have addressed the comments

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of a recent change to the definition of S1, can you merge with main, address the comment below, and if the tests pass, we should be good to go.

# Essentially, an intersection between Series S1 TypeVar, and ByT TypeVar
SeriesByT = TypeVar(
"SeriesByT",
str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been recently advised to no longer use constrained TypeVars, so can you change this to SeriesByT = TypeVar("SeriesByT", bound=str | bytes | ...) We made the change for S1 in a recent PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

What is the difference between TypeVar("SeriesByT", bound = str | bytes ...) and TypeVar("SeriesByT, str, byts, ...) ?

I tried to replace the SeriesByT with the bound keyword, but then the mypy did not pass anymore :

Poe => mypy

===========================================
Beginning: 'Run mypy on 'tests' (using the local stubs) and on the local stubs'
===========================================

pandas-stubs/core/series.pyi:648: error: Type variable "SeriesByT" not valid as type argument value for "SeriesGroupBy"  [type-var]
pandas-stubs/core/frame.pyi:1100: error: Type variable "SeriesByT" not valid as type argument value for "DataFrameGroupBy"  [type-var]
Found 2 errors in 2 files (checked 224 source files)

===========================================
Step: 'Run mypy on 'tests' (using the local stubs) and on the local stubs' failed!
===========================================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bound allows the type to be a union of all the listed types. Without it, the type has to be one of the listed (sub-)types.

You probably also need to adjust the type used for SeriesGroupBy and DataFrameGroupBy to use bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I actually tried that and got rid of the errors.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dr-Irv Dr-Irv merged commit d873a46 into pandas-dev:main Jun 2, 2023
@ClementPinard ClementPinard deleted the feature/groupby_series branch June 2, 2023 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple]
4 participants