-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Closed
Labels
ExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.Missing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatenp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNA - MaskedArraysRelated to pd.NA and nullable extension arraysRelated to pd.NA and nullable extension arraysNeeds TestsUnit test(s) needed to prevent regressionsUnit test(s) needed to prevent regressionscutcut, qcutcut, qcutgood first issue
Description
Code Sample
import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
breaks_cut = pd.cut(series, breaks)
breaks_cut0 NaN
1 (0.0, 2.0]
2 (0.0, 2.0]
3 (2.0, 4.0]
4 (2.0, 4.0]
5 NaN
6 (0.0, 2.0]
7 (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]
Problem Description
When using the pd.Int64 nullable integer data type, pd.cut() unexpectedly bins the first non-np.nan value after an np.nan into the lowest interval. In the above example, the number 6 is binned into (0.0, 2.0].
Expected Output
0 NaN
1 (0.0, 2.0]
2 (0.0, 2.0]
3 (2.0, 4.0]
4 (2.0, 4.0]
5 NaN
6 (4.0, 6.0]
7 (6.0, 8.0]
dtype: category
Categories (4, interval[int64]): [(0, 2] < (2, 4] < (4, 6] < (6, 8]]
Note that using an IntervalIndex produces the expected output.
import pandas as pd
series = pd.Series([0, 1, 2, 3, 4, pd.np.nan, 6, 7], dtype='Int64')
breaks = [0, 2, 4, 6, 8]
intervals = [pd.Interval(x, y) for x, y in zip(breaks[:-1], breaks[1:])]
interval_index = pd.IntervalIndex(intervals)
interval_cut = pd.cut(series, interval_index)
interval_cutOutput of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.0.0-37-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200102
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
Metadata
Metadata
Assignees
Labels
ExtensionArrayExtending pandas with custom dtypes or arrays.Extending pandas with custom dtypes or arrays.Missing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatenp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNA - MaskedArraysRelated to pd.NA and nullable extension arraysRelated to pd.NA and nullable extension arraysNeeds TestsUnit test(s) needed to prevent regressionsUnit test(s) needed to prevent regressionscutcut, qcutcut, qcutgood first issue