ENH: Allow empty groupby #55068

elashrry · 2023-09-08T16:10:21Z

closes ENH: Allow to group by an empty list #35366
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I defined the keys as a dummy list in case the user passes an empty list []. I added tests with empty groupings and tested the code running the command pytest -k "groupby and not arrow. Please, let me know if it works.

mroeschke

I think I agree with #35366 (comment) as of now and think an "empty groupby" should be already doable by just calling the associated DataFrame/Series method

elashrry · 2023-09-09T08:48:49Z

There are probably two arguments against using the aggregation functions of dataframes and series:

Their output is different from the output of aggregation after groupby.

>>> df = DataFrame(
    {
       "group": ["A", "B", "A"],
        "col1": np.arange(3),
        "col2": np.arange(3, 6),
        "col3": np.arange(6, 9),
    }
)
>>> df
  group  col1  col2  col3
0     A     0     3     6
1     B     1     4     7
2     A     2     5     8
>>> df.agg(
    col1_min=("col1", "min"),
    col2_min=("col2", "sum"),
)
          col1  col2
col1_min   0.0   NaN
col2_min   NaN  12.0
>>> df.groupby([0, 0, 0]).agg(
    col1_min=("col1", "min"),
    col2_min=("col2", "sum"),
)
   col1_min  col2_min
0         0        12

Even if it was the same output as aggregation after using groupby, it will still be a nice feature to have to allow uniform behaviour in cases of automation, knowing that group lists could be empty, instead of working around it with if statements.

Your decision could close this PR and the mentioned issue. We can:

Keep things as they are.
Change the behaviour of agg function of dataframes and series to match those of aggregation after using groupby and, maybe or maybe not, make groupby's aggregation methods fallback to those of dataframes and series in case of empty keys passed by the user.
Accept empty keys for gorupby and keep the dataframes and series aggregation unchanged. (this PR)

In my opinion, I prefer option 2 but I don't know the reason behind choosing this particular shape of the output of dataframes aggregations. If you want to keep it unchanged or you think it would take time to change, then I am strongly for option 3, having the option of passing empty keys in groupby would be a nice feature to have.

elashrry · 2023-09-09T09:55:50Z

I just realised that the "empty list" provided by the user could be of different types (list, tuples, numpy arrays, ..etc) so I can create a function, say if_empty(keys) to include the condition to check the user passed empty keys and account for multiple possible data types instead of the condition (isinstance(keys, list) and keys == []) that only works with lists.

There is also the problem with having the dummy list as zeros integers, they are treated as int64 on macOS and as int32 on Windows. I don't know what the best way to handle that.

Please let me know your thoughts, so I can continue or stop working on the PR.

rhshadrach · 2023-09-09T13:02:32Z

I also agree on not supporting grouping by empty lists. If users really want to do this, they can:

if len(keys) == 0:
    keys = np.zeros(len(df))
result = df.grouby(keys)...

It does look like there may be some opportunities for improving named aggregation on DataFrames.

There is also the problem with having the dummy list as zeros integers, they are treated as int64 on macOS and as int32 on Windows. I don't know what the best way to handle that.

This will be fixed in NumPy 2.0. For now, you could use np.int64(0) if you want consistent integers.

elashrry · 2023-09-09T13:51:07Z

Thank you, @rhshadrach for the feedback and the solution for the index problem.

Could you please explain the rationale behind not accepting grouping by empty lists?

I will fix the index issue, just in case, but feel free to close this PR if you think it doesn't align with the library's views, and the issue too.

Also, I would appreciate if you can give me some guidelines if I wanted to work on improving named aggregation on DataFrames.

… iterable and add tests

rhshadrach · 2023-09-09T14:29:37Z

Could you please explain the rationale behind not accepting grouping by empty lists?

It introduces a corner case across all groupby ops.
.groupby([]) is an odd request from a user, and I think could be encountered when there is a bug in user code.
We are introducing a 10-100x slower way to accomplish the same result (admittedly with reshaping by the user required at the end).
It is easy to accomplish in user code today (shown in ENH: Allow empty groupby #55068 (comment)).

rhshadrach · 2023-09-09T14:36:15Z

Also, I would appreciate if you can give me some guidelines if I wanted to work on improving named aggregation on DataFrames.

I don't have any concrete thoughts here, only that output you showed looked inconvenient to me (namely, coercing to float because of what I think are unnecessary NaNs).

rhshadrach · 2023-09-09T14:45:41Z

It introduces a corner case across all groupby ops.

Just to expand on this a little more, what is the expected output of this?

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
keys = pd.Categorical([], categories=['x', 'y'])
df.groupby(keys, observed=False).sum()

elashrry · 2023-09-09T15:31:14Z

It introduces a corner case across all groupby ops.

Just to expand on this a little more, what is the expected output of this?
df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
keys = pd.Categorical([], categories=['x', 'y'])
df.groupby(keys, observed=False).sum()

That's a good example. On one hand, I think it should aggregate all the data since it is an empty list. On the other hand, it should also return results for non-observed categories which will make the result of aggregating all the data hard to understand. I think I understand now it is better to have the user handle the empty case before passing a non-empty version of it to pandas, however they see fit for their purpose. Thank you for clarifying it. Please feel free to close the PR.

elashrry added 5 commits September 8, 2023 12:52

create dummy groupings if the user passed an empty list

b3bd468

add unit tests for empty grouping

4a04d2d

delete unit test was there to prevent empty ggrouping

249e3a0

add documentation

d69c440

move documentation to enhancements section

94bd32b

elashrry requested a review from rhshadrach as a code owner September 8, 2023 16:10

elashrry mentioned this pull request Sep 8, 2023

ENH: Allow to group by an empty list #35366

Closed

mroeschke requested changes Sep 8, 2023

View reviewed changes

use int64 for the dummy groups and modify conditions to work with any…

7c674da

… iterable and add tests

rhshadrach closed this Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Allow empty groupby #55068

ENH: Allow empty groupby #55068

Uh oh!

elashrry commented Sep 8, 2023

Uh oh!

mroeschke left a comment

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023 •

edited

Loading

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023 •

edited

Loading

Uh oh!

rhshadrach commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

Uh oh!

Uh oh!

ENH: Allow empty groupby #55068

ENH: Allow empty groupby #55068

Uh oh!

Conversation

elashrry commented Sep 8, 2023

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Sep 9, 2023

Uh oh!

rhshadrach commented Sep 9, 2023

Uh oh!

elashrry commented Sep 9, 2023

Uh oh!

Uh oh!

rhshadrach commented Sep 9, 2023 •

edited

Loading

rhshadrach commented Sep 9, 2023 •

edited

Loading