-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
14ca804
07fc53d
ddef378
a147081
36b1bf7
d01ddfe
22a2c79
686de87
0ea2d6b
358ce59
7b25d8d
46dfc20
1b96e53
3b20ffc
e4bae5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,10 +15,14 @@ | |
ensure_object, | ||
is_bool_dtype, | ||
is_categorical_dtype, | ||
is_extension_array_dtype, | ||
is_integer, | ||
is_integer_dtype, | ||
is_list_like, | ||
is_object_dtype, | ||
is_re, | ||
is_scalar, | ||
is_string_dtype, | ||
) | ||
from pandas.core.dtypes.generic import ( | ||
ABCDataFrame, | ||
|
@@ -31,6 +35,7 @@ | |
from pandas.core.algorithms import take_1d | ||
from pandas.core.base import NoNewAttributesMixin | ||
import pandas.core.common as com | ||
from pandas.core.construction import extract_array | ||
|
||
_cpython_optimized_encoders = ( | ||
"utf-8", | ||
|
@@ -109,9 +114,46 @@ def cat_safe(list_of_columns: List, sep: str): | |
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
# should really _check_ for NA | ||
if is_extension_array_dtype(arr.dtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it I guess There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean with "-1 on 2 different branches here" ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this. As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same. I think this is crucial for not adding technical debt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback is your suggestion to add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
And there is also a good reason for that, as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return _map_ea(f, arr, na_value=na_result, dtype=dtype) | ||
return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype) | ||
|
||
|
||
def _map_ea(f, arr, na_value, dtype): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
from pandas.arrays import IntegerArray, StringArray | ||
|
||
arr = extract_array(arr, extract_numpy=True) | ||
mask = isna(arr) | ||
|
||
assert isinstance(arr, StringArray) | ||
arr = arr._ndarray | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if is_integer_dtype(dtype): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
na_value_is_na = isna(na_value) | ||
if na_value_is_na: | ||
na_value = 1 | ||
result = lib.map_infer_mask( | ||
arr, | ||
f, | ||
mask.view("uint8"), | ||
convert=False, | ||
na_value=na_value, | ||
dtype=np.dtype("int64"), | ||
) | ||
|
||
if not na_value_is_na: | ||
mask[:] = False | ||
|
||
return IntegerArray(result, mask) | ||
|
||
elif is_string_dtype(dtype) and not is_object_dtype(dtype): | ||
result = lib.map_infer_mask(arr, f, mask.view("uint8"), na_value=na_value) | ||
return StringArray(result) | ||
# TODO: BooleanArray | ||
else: | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return lib.map_infer_mask(arr, f, mask.view("uint8")) | ||
|
||
|
||
def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if not len(arr): | ||
return np.ndarray(0, dtype=dtype) | ||
|
Uh oh!
There was an error while loading. Please reload this page.