Skip to content

GroupBy aggregation fails if DataFrame has CategoricalIndex #31223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
frances-h opened this issue Jan 22, 2020 · 2 comments · Fixed by #31238
Closed

GroupBy aggregation fails if DataFrame has CategoricalIndex #31223

frances-h opened this issue Jan 22, 2020 · 2 comments · Fixed by #31238
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@frances-h
Copy link

Code Sample, a copy-pastable example if possible

ids = pd.Categorical([0, 1, 2])
df = pd.DataFrame({
    'id': ids,
    'groups': [1, 1, 2],
    'value': [0, 1, 0]
}).set_index('id')

df.groupby('groups').agg({'value': pd.Series.nunique})

Problem description

The above works in v. 0.25.3 but not in 1.0.0rc0. It fails with: TypeError: Cannot convert Categorical to numpy.ndarray

Full stack trace:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 940, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/base.py", line 430, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/base.py", line 397, in _agg
    result[fname] = func(fname, agg_how)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/base.py", line 381, in _agg_1dim
    return colg.aggregate(how)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 265, in aggregate
    return self._python_agg_general(func, *args, **kwargs)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 935, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/groupby/ops.py", line 624, in agg_series
    return self._aggregate_series_fast(obj, func)
  File "~/.virtualenvs/3.6-test-pd-1.0.0/lib/python3.6/site-packages/pandas/core/groupby/ops.py", line 648, in _aggregate_series_fast
    grouper = libreduction.SeriesGrouper(obj, func, group_index, ngroups, dummy)
  File "pandas/_libs/reduction.pyx", line 329, in pandas._libs.reduction.SeriesGrouper.__init__
TypeError: Cannot convert Categorical to numpy.ndarray

Expected Output

The correct result applying the aggregation function to the DataFrameGroupBy

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0
Cython : None
pytest : 5.2.0
hypothesis : None
sphinx : 2.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.2.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.3.2
gcsfs : None
lxml.etree : None
matplotlib : 3.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.0
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.47.0

@jschendel
Copy link
Member

Thanks, I can confirm that this is broken on master and was working on 0.25.3.

@jschendel jschendel added Categorical Categorical Data Type Groupby Regression Functionality that used to work in a prior pandas version labels Jan 23, 2020
@jschendel
Copy link
Member

Looks like this is a little more generic than Categorical and occurs for any index that's backed by an extension array, e.g. ids = pd.interval_range(0, 3) causes the same error.

The following changes look to fix the issue but I haven't checked that it doesn't end up breaking something else:

diff --git a/pandas/_libs/reduction.pyx b/pandas/_libs/reduction.pyx
index 8571761f7..635c0e36d 100644
--- a/pandas/_libs/reduction.pyx
+++ b/pandas/_libs/reduction.pyx
@@ -158,7 +158,7 @@ cdef class _BaseGrouper:
         if util.is_array(values) and not values.flags.contiguous:
             # e.g. Categorical has no `flags` attribute
             values = values.copy()
-        index = dummy.index.values
+        index = dummy.index.to_numpy()
         if not index.flags.contiguous:
             index = index.copy()
 
@@ -229,7 +229,7 @@ cdef class SeriesBinGrouper(_BaseGrouper):
         self.arr = values
         self.typ = series._constructor
         self.ityp = series.index._constructor
-        self.index = series.index.values
+        self.index = series.index.to_numpy()
         self.name = series.name
 
         self.dummy_arr, self.dummy_index = self._check_dummy(dummy)
@@ -326,7 +326,7 @@ cdef class SeriesGrouper(_BaseGrouper):
         self.arr = values
         self.typ = series._constructor
         self.ityp = series.index._constructor
-        self.index = series.index.values
+        self.index = series.index.to_numpy()
         self.name = series.name
 
         self.dummy_arr, self.dummy_index = self._check_dummy(dummy)

@jschendel jschendel added this to the 1.0.0 milestone Jan 23, 2020
@jschendel jschendel added ExtensionArray Extending pandas with custom dtypes or arrays. and removed Categorical Categorical Data Type labels Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
2 participants