Skip to content

ENH: Allow empty groupby #55068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

elashrry
Copy link

@elashrry elashrry commented Sep 8, 2023

I defined the keys as a dummy list in case the user passes an empty list []. I added tests with empty groupings and tested the code running the command pytest -k "groupby and not arrow. Please, let me know if it works.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with #35366 (comment) as of now and think an "empty groupby" should be already doable by just calling the associated DataFrame/Series method

@elashrry
Copy link
Author

elashrry commented Sep 9, 2023

There are probably two arguments against using the aggregation functions of dataframes and series:

  1. Their output is different from the output of aggregation after groupby.
>>> df = DataFrame(
    {
       "group": ["A", "B", "A"],
        "col1": np.arange(3),
        "col2": np.arange(3, 6),
        "col3": np.arange(6, 9),
    }
)
>>> df
  group  col1  col2  col3
0     A     0     3     6
1     B     1     4     7
2     A     2     5     8
>>> df.agg(
    col1_min=("col1", "min"),
    col2_min=("col2", "sum"),
)
          col1  col2
col1_min   0.0   NaN
col2_min   NaN  12.0
>>> df.groupby([0, 0, 0]).agg(
    col1_min=("col1", "min"),
    col2_min=("col2", "sum"),
)
   col1_min  col2_min
0         0        12
  1. Even if it was the same output as aggregation after using groupby, it will still be a nice feature to have to allow uniform behaviour in cases of automation, knowing that group lists could be empty, instead of working around it with if statements.

Your decision could close this PR and the mentioned issue. We can:

  1. Keep things as they are.
  2. Change the behaviour of agg function of dataframes and series to match those of aggregation after using groupby and, maybe or maybe not, make groupby's aggregation methods fallback to those of dataframes and series in case of empty keys passed by the user.
  3. Accept empty keys for gorupby and keep the dataframes and series aggregation unchanged. (this PR)

In my opinion, I prefer option 2 but I don't know the reason behind choosing this particular shape of the output of dataframes aggregations. If you want to keep it unchanged or you think it would take time to change, then I am strongly for option 3, having the option of passing empty keys in groupby would be a nice feature to have.

@elashrry
Copy link
Author

elashrry commented Sep 9, 2023

I just realised that the "empty list" provided by the user could be of different types (list, tuples, numpy arrays, ..etc) so I can create a function, say if_empty(keys) to include the condition to check the user passed empty keys and account for multiple possible data types instead of the condition (isinstance(keys, list) and keys == []) that only works with lists.

There is also the problem with having the dummy list as zeros integers, they are treated as int64 on macOS and as int32 on Windows. I don't know what the best way to handle that.

Please let me know your thoughts, so I can continue or stop working on the PR.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 9, 2023

I also agree on not supporting grouping by empty lists. If users really want to do this, they can:

if len(keys) == 0:
    keys = np.zeros(len(df))
result = df.grouby(keys)...

It does look like there may be some opportunities for improving named aggregation on DataFrames.

There is also the problem with having the dummy list as zeros integers, they are treated as int64 on macOS and as int32 on Windows. I don't know what the best way to handle that.

This will be fixed in NumPy 2.0. For now, you could use np.int64(0) if you want consistent integers.

@elashrry
Copy link
Author

elashrry commented Sep 9, 2023

Thank you, @rhshadrach for the feedback and the solution for the index problem.

Could you please explain the rationale behind not accepting grouping by empty lists?

I will fix the index issue, just in case, but feel free to close this PR if you think it doesn't align with the library's views, and the issue too.

Also, I would appreciate if you can give me some guidelines if I wanted to work on improving named aggregation on DataFrames.

@rhshadrach
Copy link
Member

rhshadrach commented Sep 9, 2023

Could you please explain the rationale behind not accepting grouping by empty lists?

  1. It introduces a corner case across all groupby ops.
  2. .groupby([]) is an odd request from a user, and I think could be encountered when there is a bug in user code.
  3. We are introducing a 10-100x slower way to accomplish the same result (admittedly with reshaping by the user required at the end).
  4. It is easy to accomplish in user code today (shown in ENH: Allow empty groupby #55068 (comment)).

@rhshadrach
Copy link
Member

Also, I would appreciate if you can give me some guidelines if I wanted to work on improving named aggregation on DataFrames.

I don't have any concrete thoughts here, only that output you showed looked inconvenient to me (namely, coercing to float because of what I think are unnecessary NaNs).

@rhshadrach
Copy link
Member

  1. It introduces a corner case across all groupby ops.

Just to expand on this a little more, what is the expected output of this?

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
keys = pd.Categorical([], categories=['x', 'y'])
df.groupby(keys, observed=False).sum()

@elashrry
Copy link
Author

elashrry commented Sep 9, 2023

  1. It introduces a corner case across all groupby ops.

Just to expand on this a little more, what is the expected output of this?

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
keys = pd.Categorical([], categories=['x', 'y'])
df.groupby(keys, observed=False).sum()

That's a good example. On one hand, I think it should aggregate all the data since it is an empty list. On the other hand, it should also return results for non-observed categories which will make the result of aggregating all the data hard to understand. I think I understand now it is better to have the user handle the empty case before passing a non-empty version of it to pandas, however they see fit for their purpose. Thank you for clarifying it. Please feel free to close the PR.

@rhshadrach rhshadrach closed this Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Allow to group by an empty list
3 participants