Skip to content

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ClementPinard opened this issue May 23, 2023 · 3 comments · Fixed by #708
Closed

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

ClementPinard opened this issue May 23, 2023 · 3 comments · Fixed by #708
Labels

Comments

@ClementPinard
Copy link
Contributor

Describe the bug

When using a Series in a groupby, the byT type is supposed to be tuple because Series is included in the TypeVar GroupByObjectNonScalar, but since there is only one Series, it should not be a tuple, but rather the type of the Series we use in the groupby.

See how the GroupByObjectNonScalar is declared here :

(note how it includes both Series and list[Series], same goes with Function or Grouper probably.

See how the resulting groupby is decided to be SeriesGroupBy[S1, tuple] here when by is either a MultiIndex or a GroupByObjectNonScalar :

It seems to me that the GroupByObjectNonScalar is a bit too broad and includes both objects representing only one grouper and multiple groupers (where the SeriesGroupBy[S1, tuple] is consistent). So maybe we should divide this TypeVar in two and change the groupby overloads so that when only one grouper is present, the type is Any ?

_typing.pyi

MultiGroupByObjectNonScalar: TypeAlias = (
    tuple
    | list[_HashableTa]
    | list[Function]
    | list[Series]
    | np.ndarray 
    | list[np.ndarray]
    | list[Mapping[Label, Any]]
    | list[Index]
    | list[Grouper]
)
MonoGroupByObjectNonScalar: TypeAlias = (
    Function
    | Series
    | Mapping[Label, Any]
    | Grouper
)

core/series.pyi

    @overload
    def groupby(
        self,
        by: MultiIndex | MultiGroupByObjectNonScalar = ...,
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, tuple]: ...
    @overload
    def groupby(
        self,
        by: CategoricalIndex | Index | MonoGroupByObjectNonScalar,
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, Any]: ...

Note that I did not include the np.ndarray object in MonoGroupByObjectNonScalar since it can be 2D. Not sure how this one should be treated.

Of course this proposition is just an illustration, I suppose such division would call for many other modification elsewhere, especially within core/frame.pyi and that there might be better solution.

To Reproduce

The following example shows that the type of output is deduced to be list[tuple[tuple, Series]] while it should be list[tuple[bool, Series]]

import pandas as pd
import numpy as np
from typing import reveal_type

series = pd.Series(np.random.randn(10))
above_0 = series > 0

output = list(series.groupby(above_0))

print(output)
reveal_type(output)

output when running with python :

[(False, 0   -0.666977
2   -1.586030
3   -0.862037
5   -0.617085
6   -1.553402
8   -0.694892
9   -0.492148
dtype: float64), (True, 1    0.848925
4    0.603415
7    0.938546
dtype: float64)]

Output when running with pyright

example.py:11:13 - information: Type of "output" is "list[tuple[tuple[Unknown, ...], Series[Unknown]]]"

This happens because the list function calls the __iter__ method which is typed here :

def __iter__(self) -> Iterator[tuple[ByT, Series[S1]]]: ...

Please complete the following information:

  • OS: Ubuntu
  • OS 22.04
  • python version : 3.10
  • version of type checker : pyright v1.1.298
  • version of installed pandas-stubs : 2.0.1.230501
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 23, 2023

I'm open to a PR that would address this.

@Dr-Irv Dr-Irv added the Groupby label May 23, 2023
@ClementPinard
Copy link
Contributor Author

Thank you for answering, I might try a take on that.

Quick question, as I am not a pro on generic and typevars (yet 😄 ).

Is this valid ?

    @overload
    def groupby(
        self,
        by: Series[S1],
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, S1]: ...

I suppose it is but only if we want to groupby by a Series with the dtype as the first Series. Would it be possible then to have another typevar similar to S1, but just not correlated ? Would I need to create this new S2 TypeVar somewhere ?

@overload
    def groupby(
        self,
        by: Series[S2],
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, S2]: ...

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented May 24, 2023

If you want Series[S1].groupby() to allow a Series[S2] as an argument, where S1 and S2 are really covering the same types, but you want the result of SeriesGroupBy[S1, S2], then you would need to duplicate the TypeVar of S1 and define S2. We had to do that with HashableT# .

Note: I'm pretty sure this is what you'd need to do to get it to work, but not 100% sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants