-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Improve support for pandas Extension Arrays (#10301) #10380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
def preprocess_types(t): | ||
if isinstance(t, str | bytes): | ||
return type(t) | ||
elif isinstance(dtype := getattr(t, "dtype", t), np.dtype) and ( | ||
np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_) | ||
): | ||
def maybe_promote_to_variable_width( | ||
array_or_dtype: np.typing.ArrayLike | np.typing.DTypeLike, | ||
) -> np.typing.ArrayLike | np.typing.DTypeLike: | ||
if isinstance(array_or_dtype, str | bytes): | ||
return type(array_or_dtype) | ||
elif isinstance( | ||
dtype := getattr(array_or_dtype, "dtype", array_or_dtype), np.dtype | ||
) and (np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_)): | ||
# drop the length from numpy's fixed-width string dtypes, it is better to | ||
# recalculate | ||
# TODO(keewis): remove once the minimum version of `numpy.result_type` does this | ||
# for us | ||
return dtype.type | ||
else: | ||
return t | ||
return array_or_dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff looks ugly, but it's simply renaming the fn + its argument.
except TypeError: | ||
# passing individual objects to xp.result_type means NEP-18 implementations won't have | ||
# a chance to intercept special values (such as NA) that numpy core cannot handle | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only this except
block is new. The rest of this fn was lifted as-is from result_type()
below.
if any(is_extension_array_dtype(x) for x in scalars_or_arrays): | ||
extension_array_types = [ | ||
x.dtype for x in scalars_or_arrays if is_extension_array_dtype(x) | ||
] | ||
if len(extension_array_types) == len(scalars_or_arrays) and all( | ||
isinstance(x, type(extension_array_types[0])) for x in extension_array_types | ||
): | ||
return scalars_or_arrays | ||
raise ValueError( | ||
"Cannot cast arrays to shared type, found" | ||
f" array types {[x.dtype for x in scalars_or_arrays]}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we now provide an array-api implementation of np.result_type
, we no longer need these special cases. (which were far too special, IMO; the cases where we raised ValueError
are perfectly valid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on, it's great work!
Can you extract out any non-EA changes to duck_array_ops.py
and dtypes.py
and make a separate PR please? It will be far easier to review then
The core ideas here are:
np.result_type(*arrays_or_dtypes)
. This unlocks arbitrary N-ary operations on ExtensionArrays without loss of type info (as found in pre-2024 releases) or blowing up due to lack of EA-specific implementations (as documented in Regression in DataArrays created from Pandas #10301).pd.Series
Minor refactors & bugfixes are documented inline.
whats-new.rst
api.rst