-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: groupby.idxmax/idxmin consistently raise on unobserved categorical #55268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
1325855
e2dd88a
5daaaed
c730ddb
2bcdd81
a97cd21
d62297c
34f4116
2598821
b2a36eb
a1954d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2015,10 +2015,14 @@ def _transform(self, func, *args, engine=None, engine_kwargs=None, **kwargs): | |
with com.temp_setattr(self, "as_index", True): | ||
# GH#49834 - result needs groups in the index for | ||
# _wrap_transform_fast_result | ||
if engine is not None: | ||
kwargs["engine"] = engine | ||
kwargs["engine_kwargs"] = engine_kwargs | ||
result = getattr(self, func)(*args, **kwargs) | ||
if func in ["idxmin", "idxmax"]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it not possible to stick with the same engine pattern that is in place for this? At first glance I'm wondering what makes these different from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think this is the same issue as i asked about somewhere else: _idxmax_idxmin accepts ignore_unobserved while idxmin/idxmax do not There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The main difference in behavior is that min/max do not raise on unobserved categories, while idxmin and idxmax do. Example - agg
However, the fact that we don't do something special for min/max means that transform unnecessarily coerces to float: Example - transform
I consider this a bug in min/max with transform. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've opened #55326 to track There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah OK thanks - that is helpful context. So do you see the solution for min/max being the same as what you have for idxmin/idxmax here? I think the broader issue is that we've wanted over time to move away from branching for function specializations within groupby. If that still holds true then I wonder what prevents us from sticking with the existing kwargs interface to solve both this PR and eventually solve min/max's issue There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to explore different solutions for min/max and other aggregations, but I don't know what that could be at this time. I don't understand what you're suggesting with kwargs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry I thought the engine / engine_kwargs were specialized arguments for each function implementation, but I see now those are meant for numba. The numba functions are UDFs right? I'm assuming from the branch here that we would never want to pass numba arguments to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct - we never get here when using numba. |
||
func = cast(Literal["idxmin", "idxmax"], func) | ||
result = self._idxmax_idxmin(func, True, *args, **kwargs) | ||
else: | ||
if engine is not None: | ||
kwargs["engine"] = engine | ||
kwargs["engine_kwargs"] = engine_kwargs | ||
result = getattr(self, func)(*args, **kwargs) | ||
|
||
return self._wrap_transform_fast_result(result) | ||
|
||
|
@@ -5720,6 +5724,113 @@ def sample( | |
sampled_indices = np.concatenate(sampled_indices) | ||
return self._selected_obj.take(sampled_indices, axis=self.axis) | ||
|
||
def _idxmax_idxmin( | ||
self, | ||
how: Literal["idxmax", "idxmin"], | ||
ignore_unobserved: bool = False, | ||
axis: Axis | None | lib.NoDefault = lib.no_default, | ||
skipna: bool = True, | ||
numeric_only: bool = False, | ||
): | ||
"""Compute idxmax/idxmin. | ||
|
||
Parameters | ||
---------- | ||
how: {"idxmin", "idxmax"} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick i think missing space between "how" and colon There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also usually but not always we have single-quotes inside these (no idea why and i genuinely dont care, extra since this is private) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks - fixed in #54234 |
||
Whether to compute idxmin or idxmax. | ||
axis : {{0 or 'index', 1 or 'columns'}}, default None | ||
The axis to use. 0 or 'index' for row-wise, 1 or 'columns' for column-wise. | ||
If axis is not provided, grouper's axis is used. | ||
numeric_only : bool, default False | ||
Include only float, int, boolean columns. | ||
skipna : bool, default True | ||
Exclude NA/null values. If an entire row/column is NA, the result | ||
will be NA. | ||
ignore_unobserved : bool, default False | ||
When True and an unobserved group is encountered, do not raise. This used | ||
for transform where unobserved groups do not play an impact on the result. | ||
|
||
Returns | ||
------- | ||
Series or DataFrame | ||
idxmax or idxmin for the groupby operation. | ||
""" | ||
if axis is not lib.no_default: | ||
if axis is None: | ||
axis = self.axis | ||
axis = self.obj._get_axis_number(axis) | ||
self._deprecate_axis(axis, how) | ||
else: | ||
axis = self.axis | ||
|
||
if not self.observed and any( | ||
ping._passed_categorical for ping in self.grouper.groupings | ||
): | ||
expected_len = np.prod( | ||
[len(ping.group_index) for ping in self.grouper.groupings] | ||
) | ||
if len(self.grouper.groupings) == 1: | ||
result_len = len(self.grouper.groupings[0].grouping_vector.unique()) | ||
else: | ||
# result_index only contains observed groups in this case | ||
result_len = len(self.grouper.result_index) | ||
assert result_len <= expected_len | ||
has_unobserved = result_len < expected_len | ||
|
||
raise_err: bool | np.bool_ = not ignore_unobserved and has_unobserved | ||
# Only raise an error if there are columns to compute; otherwise we return | ||
# an empty DataFrame with an index (possibly including unobserved) but no | ||
# columns | ||
data = self._obj_with_exclusions | ||
if raise_err and isinstance(data, DataFrame): | ||
if numeric_only: | ||
data = data._get_numeric_data() | ||
raise_err = len(data.columns) > 0 | ||
else: | ||
raise_err = False | ||
if raise_err: | ||
raise ValueError( | ||
f"Can't get {how} of an empty group due to unobserved categories. " | ||
"Specify observed=True in groupby instead." | ||
) | ||
|
||
try: | ||
if self.obj.ndim == 1: | ||
result = self._op_via_apply(how, skipna=skipna) | ||
else: | ||
|
||
def func(df): | ||
method = getattr(df, how) | ||
return method(axis=axis, skipna=skipna, numeric_only=numeric_only) | ||
|
||
func.__name__ = how | ||
result = self._python_apply_general( | ||
func, self._obj_with_exclusions, not_indexed_same=True | ||
) | ||
except ValueError as err: | ||
name = "argmax" if how == "idxmax" else "argmin" | ||
if f"attempt to get {name} of an empty sequence" in str(err): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would this be simpler if we just said changed "arg" to "idx" in the cython method with a comment as to why we are using an apparently-wrong message? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This arises from a call to NumPy's argmin in nanops.nanargmin:
In #54234, this remains only for the |
||
raise ValueError( | ||
f"Can't get {how} of an empty group due to unobserved categories. " | ||
"Specify observed=True in groupby instead." | ||
) from None | ||
raise | ||
|
||
result = result.astype(self.obj.index.dtype) if result.empty else result | ||
|
||
if not skipna: | ||
has_na_value = result.isnull().any(axis=None) | ||
if has_na_value: | ||
warnings.warn( | ||
f"The behavior of {type(self).__name__}.{how} with all-NA " | ||
"values, or any-NA and skipna=False, is deprecated. In a future " | ||
"version this will raise ValueError", | ||
FutureWarning, | ||
stacklevel=find_stack_level(), | ||
) | ||
|
||
return result | ||
|
||
|
||
@doc(GroupBy) | ||
def get_groupby( | ||
|
Uh oh!
There was an error while loading. Please reload this page.