-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Allow empty groupby #55068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow empty groupby #55068
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with #35366 (comment) as of now and think an "empty groupby" should be already doable by just calling the associated DataFrame/Series method
There are probably two arguments against using the aggregation functions of dataframes and series:
Your decision could close this PR and the mentioned issue. We can:
In my opinion, I prefer option 2 but I don't know the reason behind choosing this particular shape of the output of dataframes aggregations. If you want to keep it unchanged or you think it would take time to change, then I am strongly for option 3, having the option of passing empty keys in |
I just realised that the "empty list" provided by the user could be of different types (list, tuples, numpy arrays, ..etc) so I can create a function, say There is also the problem with having the dummy list as zeros integers, they are treated as Please let me know your thoughts, so I can continue or stop working on the PR. |
I also agree on not supporting grouping by empty lists. If users really want to do this, they can:
It does look like there may be some opportunities for improving named aggregation on DataFrames.
This will be fixed in NumPy 2.0. For now, you could use |
Thank you, @rhshadrach for the feedback and the solution for the index problem. Could you please explain the rationale behind not accepting grouping by empty lists? I will fix the index issue, just in case, but feel free to close this PR if you think it doesn't align with the library's views, and the issue too. Also, I would appreciate if you can give me some guidelines if I wanted to work on improving named aggregation on DataFrames. |
… iterable and add tests
|
I don't have any concrete thoughts here, only that output you showed looked inconvenient to me (namely, coercing to float because of what I think are unnecessary NaNs). |
Just to expand on this a little more, what is the expected output of this?
|
That's a good example. On one hand, I think it should aggregate all the data since it is an empty list. On the other hand, it should also return results for non-observed categories which will make the result of aggregating all the data hard to understand. I think I understand now it is better to have the user handle the empty case before passing a non-empty version of it to pandas, however they see fit for their purpose. Thank you for clarifying it. Please feel free to close the PR. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I defined the keys as a dummy list in case the user passes an empty list
[]
. I added tests with empty groupings and tested the code running the commandpytest -k "groupby and not arrow
. Please, let me know if it works.