-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
Arrowpyarrow functionalitypyarrow functionalityBugReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.StringsString extension data type and string dataString extension data type and string data
Description
Motivation
In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.
To do this it is useful to compute min/max values.
What actually breaks
When I try to do this on columns of type string[pyarrow]
I get the following exception
import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10825 )
10826 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827 return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
10828
10829 setattr(cls, "min", min)
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10348
10349 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350 return self._stat_function(
10351 "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
10352 )
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
10343 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
10344 )
> 10345 return self._reduce(
10346 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
10347 )
~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4380 if isinstance(delegate, ExtensionArray):
4381 # dispatch to ExtensionArray interface
-> 4382 return delegate._reduce(name, skipna=skipna, **kwds)
4383
4384 else:
~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
377 def _reduce(self, name: str, skipna: bool = True, **kwargs):
378 if name in ["min", "max"]:
--> 379 return getattr(self, name)(skipna=skipna)
380
381 raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
AttributeError: 'ArrowStringArray' object has no attribute 'min'
Solution
I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.
Metadata
Metadata
Assignees
Labels
Arrowpyarrow functionalitypyarrow functionalityBugReduction Operationssum, mean, min, max, etc.sum, mean, min, max, etc.StringsString extension data type and string dataString extension data type and string data