Skip to content

CategoricalDtype does not work properly with bool column with missing values. #19182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
everdark opened this issue Jan 11, 2018 · 3 comments · Fixed by #29344
Closed

CategoricalDtype does not work properly with bool column with missing values. #19182

everdark opened this issue Jan 11, 2018 · 3 comments · Fixed by #29344
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@everdark
Copy link

import numpy as np
import pandas as pd

df = pd.DataFrame({'b': [True, False, np.nan]})

# ValueError: Buffer dtype mismatch, expected 'Python object' but got 'unsigned long'.
df.b.astype(pd.api.types.CategoricalDtype(categories=[True, False]))

# Same error as above.
df.b.astype('category', categories=[True, False])

# Pass an object type instead. Same error in both methods.
df.b.astype(pd.api.types.CategoricalDtype(categories=pd.Series([True, False]).astype('object').values))
df.b.astype('category', categories=pd.Series([True, False]).astype('object').values)

# No error.
df.b.astype('category')

Problem description

When I'd like to convert a boolean column with missing values (so the column is indeed of type object instead of a bool) into a category variable using a customized category order, both astype('category', categories=...) and pd.api.types.CategoricalDtype(...) failed to do so.

However if no customized ordered given, astype('category') do work without error.

Expected outcome

I should be able to run

astype('category', categories=[True, False])` 

and equivalently

astype(pd.api.types.CategoricalDtype(categories=[True, False]))

without error.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: 3.0.6
pip: 9.0.1
setuptools: 36.2.5
Cython: 0.25.2
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.7 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.1.2
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type Difficulty Intermediate labels Jan 11, 2018
@jreback jreback added this to the Next Major Release milestone Jan 11, 2018
@jreback
Copy link
Contributor

jreback commented Jan 11, 2018

This patch fixes. It doesn't break anything either, which means this path was not being fully exercised.

Note that this might show some reduced performance in a couple of cases (so need to run asv). as well as add some test coverage of the hash bool path. (also let's update the comment there).

diff --git a/pandas/core/algorithms.py b/pandas/core/algorithms.py
index c754c063f..723f21e70 100644
--- a/pandas/core/algorithms.py
+++ b/pandas/core/algorithms.py
@@ -69,7 +69,7 @@ def _ensure_data(values, dtype=None):
         if is_bool_dtype(values) or is_bool_dtype(dtype):
             # we are actually coercing to uint64
             # until our algos support uint8 directly (see TODO)
-            return np.asarray(values).astype('uint64'), 'bool', 'uint64'
+            return np.asarray(values).astype('object'), 'bool', 'object'
         elif is_signed_integer_dtype(values) or is_signed_integer_dtype(dtype):
             return _ensure_int64(values), 'int64', 'int64'
         elif (is_unsigned_integer_dtype(values) or

@jreback jreback changed the title CategoricalDtype does not work properly with bool column with missing values. CategoricalDtype does not work properly with bool column with missing values. Jan 11, 2018
@mroeschke
Copy link
Member

This looks to work on master. Could use a test

In [113]: pd.__version__
Out[113]: '0.26.0.dev0+684.g953757a3e'

In [114]: df = pd.DataFrame({'b': [True, False, np.nan]})
     ...:

In [115]: df.b.astype(pd.api.types.CategoricalDtype(categories=[True, False]))
     ...:
Out[115]:
0     True
1    False
2      NaN
Name: b, dtype: category
Categories (2, object): [True, False]

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Oct 27, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Nov 2, 2019
@simonjayhawkins
Copy link
Member

closed in #29344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants