Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

ClementPinard · 2023-05-23T19:43:48Z

Describe the bug

When using a Series in a groupby, the byT type is supposed to be tuple because Series is included in the TypeVar GroupByObjectNonScalar, but since there is only one Series, it should not be a tuple, but rather the type of the Series we use in the groupby.

See how the GroupByObjectNonScalar is declared here :

pandas-stubs/pandas-stubs/_typing.pyi

Line 435 in 70c412c

| Series

(note how it includes both Series and list[Series], same goes with Function or Grouper probably.

See how the resulting groupby is decided to be SeriesGroupBy[S1, tuple] here when by is either a MultiIndex or a GroupByObjectNonScalar :

pandas-stubs/pandas-stubs/core/series.pyi

Line 627 in 70c412c

def groupby(

It seems to me that the GroupByObjectNonScalar is a bit too broad and includes both objects representing only one grouper and multiple groupers (where the SeriesGroupBy[S1, tuple] is consistent). So maybe we should divide this TypeVar in two and change the groupby overloads so that when only one grouper is present, the type is Any ?

_typing.pyi

MultiGroupByObjectNonScalar: TypeAlias = (
    tuple
    | list[_HashableTa]
    | list[Function]
    | list[Series]
    | np.ndarray 
    | list[np.ndarray]
    | list[Mapping[Label, Any]]
    | list[Index]
    | list[Grouper]
)
MonoGroupByObjectNonScalar: TypeAlias = (
    Function
    | Series
    | Mapping[Label, Any]
    | Grouper
)

core/series.pyi

    @overload
    def groupby(
        self,
        by: MultiIndex | MultiGroupByObjectNonScalar = ...,
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, tuple]: ...
    @overload
    def groupby(
        self,
        by: CategoricalIndex | Index | MonoGroupByObjectNonScalar,
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, Any]: ...

Note that I did not include the np.ndarray object in MonoGroupByObjectNonScalar since it can be 2D. Not sure how this one should be treated.

Of course this proposition is just an illustration, I suppose such division would call for many other modification elsewhere, especially within core/frame.pyi and that there might be better solution.

To Reproduce

The following example shows that the type of output is deduced to be list[tuple[tuple, Series]] while it should be list[tuple[bool, Series]]

import pandas as pd
import numpy as np
from typing import reveal_type

series = pd.Series(np.random.randn(10))
above_0 = series > 0

output = list(series.groupby(above_0))

print(output)
reveal_type(output)

output when running with python :

[(False, 0   -0.666977
2   -1.586030
3   -0.862037
5   -0.617085
6   -1.553402
8   -0.694892
9   -0.492148
dtype: float64), (True, 1    0.848925
4    0.603415
7    0.938546
dtype: float64)]

Output when running with pyright

example.py:11:13 - information: Type of "output" is "list[tuple[tuple[Unknown, ...], Series[Unknown]]]"

This happens because the list function calls the __iter__ method which is typed here :

pandas-stubs/pandas-stubs/core/groupby/generic.pyi

Line 143 in 70c412c

def __iter__(self) -> Iterator[tuple[ByT, Series[S1]]]: ...

Please complete the following information:

OS: Ubuntu
OS 22.04
python version : 3.10
version of type checker : pyright v1.1.298
version of installed pandas-stubs : 2.0.1.230501

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2023-05-23T21:31:35Z

I'm open to a PR that would address this.

ClementPinard · 2023-05-24T07:42:16Z

Thank you for answering, I might try a take on that.

Quick question, as I am not a pro on generic and typevars (yet 😄 ).

Is this valid ?

    @overload
    def groupby(
        self,
        by: Series[S1],
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, S1]: ...

I suppose it is but only if we want to groupby by a Series with the dtype as the first Series. Would it be possible then to have another typevar similar to S1, but just not correlated ? Would I need to create this new S2 TypeVar somewhere ?

@overload
    def groupby(
        self,
        by: Series[S2],
        axis: AxisIndex = ...,
        level: Level | None = ...,
        as_index: _bool = ...,
        sort: _bool = ...,
        group_keys: _bool = ...,
        squeeze: _bool = ...,
        observed: _bool = ...,
        dropna: _bool = ...,
    ) -> SeriesGroupBy[S1, S2]: ...

Dr-Irv · 2023-05-24T14:35:20Z

If you want Series[S1].groupby() to allow a Series[S2] as an argument, where S1 and S2 are really covering the same types, but you want the result of SeriesGroupBy[S1, S2], then you would need to duplicate the TypeVar of S1 and define S2. We had to do that with HashableT# .

Note: I'm pretty sure this is what you'd need to do to get it to work, but not 100% sure.

Dr-Irv added the Groupby label May 23, 2023

ClementPinard mentioned this issue May 24, 2023

add proper type when grouping by a Series #708

Merged

2 tasks

Dr-Irv closed this as completed in #708 Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

ClementPinard commented May 23, 2023

Dr-Irv commented May 23, 2023

Uh oh!

ClementPinard commented May 24, 2023

Uh oh!

Dr-Irv commented May 24, 2023

Uh oh!

Uh oh!

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

Series.groupby(by: Series) should not return SeriesGroupBy[S1, tuple] #706

Comments

ClementPinard commented May 23, 2023

Dr-Irv commented May 23, 2023

Uh oh!

ClementPinard commented May 24, 2023

Uh oh!

Dr-Irv commented May 24, 2023

Uh oh!