Skip to content

groupby() drops categorical columns when aggregating with isna() #29837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xujiboy opened this issue Nov 25, 2019 · 7 comments · Fixed by #35039
Closed

groupby() drops categorical columns when aggregating with isna() #29837

xujiboy opened this issue Nov 25, 2019 · 7 comments · Fixed by #35039
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@xujiboy
Copy link

xujiboy commented Nov 25, 2019

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({'A': [1, 1, 1, 1],
                   'B': [1, 2, 1, 2],
                   'numerical_col': [.1, .2, np.nan, .3],
                   'object_col': ['foo','bar','foo','fee'],
                   'categorical_col': ['foo','bar','foo','fee']
                  })

df = df.astype({'categorical_col':'category'})

df.groupby(['A','B']).agg(lambda df: df.isna().sum())

#		numerical_col	object_col
# A	B		
# 1	1	1.0                   0
#       2	0.0	              0

Problem description

The categorical column "categorical_col" is expected to survive the aggregation, however, it gets dropped.

Expected Output

#		numerical_col	object_col categorical_col
# A	B		
# 1	1	1.0                   0                 0
#       2	0.0	              0                 0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.11.6.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.3.1
pip: 19.3.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml: 4.3.0
bs4: None
html5lib: None
sqlalchemy: 1.2.13
pymysql: 0.9.3
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 25, 2019

you need to try with a much newer version and/or master

@xujiboy
Copy link
Author

xujiboy commented Nov 25, 2019

Thanks for the suggestion @jreback. However, I observed the same behavior with pandas==0.25.1

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.1
setuptools : 41.0.1
Cython : 0.29.13
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.8
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

@jbrockmendel jbrockmendel added Apply Apply, Aggregate, Transform, Map Categorical Categorical Data Type Groupby labels Nov 30, 2019
@AskaryanKarine
Copy link

AskaryanKarine commented Dec 2, 2019

How much do you need this data type at the time of using groupby()? Have you considered this solution?

df = pd.DataFrame({'A' : [ 1 , 1 , 1 , 1],
 'B' : [ 1 , 2 , 1 , 2],
 'numerical_col' : [ .1, .2, np.nan, .3],
 'object_col' : ['foo', 'bar', 'foo', ' fee'],
 'categorical_col': ['foo','bar','foo','fee']
 })
df_double = df
df = df.astype({'categorical_col': 'category'})
df_double.groupby([df['A'], df['B']]).agg(lambda df: df.isna().sum())
df_double = None

@xujiboy
Copy link
Author

xujiboy commented Dec 3, 2019

@AskariyanKarine Thanks for the suggestion. It is not that I need a workaround, but that a consistent and expected behavior is needed.

@biddwan09
Copy link
Contributor

Hi I am really interested in contributing to pandas would love to work on this issue . Is this issue already resolved ?

@mroeschke
Copy link
Member

Looks like this works on master. Could use a test

In [188]: df = pd.DataFrame({'A': [1, 1, 1, 1],
     ...:                    'B': [1, 2, 1, 2],
     ...:                    'numerical_col': [.1, .2, np.nan, .3],
     ...:                    'object_col': ['foo','bar','foo','fee'],
     ...:                    'categorical_col': ['foo','bar','foo','fee']
     ...:                   })
     ...:
     ...: df = df.astype({'categorical_col':'category'})
     ...:
     ...: df.groupby(['A','B']).agg(lambda df: df.isna().sum())
Out[188]:
     numerical_col  object_col  categorical_col
A B
1 1            1.0           0                0
  2            0.0           0                0

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Apply Apply, Aggregate, Transform, Map Categorical Categorical Data Type Groupby labels Jun 28, 2020
@biddwan09
Copy link
Contributor

Sure will add a test case for this in groupby section

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants