unexpected behavior with NaN in multiIndex #15107

micheledemeo · 2017-01-11T10:19:05Z

df = pd.DataFrame({'col':[1, 2, 3, 4, 5], 'ind1':['a','b','c','d',np.nan], 'ind2':[1,2,3,4,5] })
df.set_index(['ind1','ind2'],inplace=True)

df.loc['a']
#     col
# ind2
# 1     1

df.loc[['a']]
#            col
# ind1 ind2
# a    1       1
#      5       5

df.loc[['a']].reset_index()
#   ind1  ind2  col
# 0    a     1    1
# 1    a     5    5

Problem description

Is it normal replacing np.nan with 'a' ?

INSTALLED VERSIONS ------------------ commit: None python: 2.7.5.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.28.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.0.0
Cython: None
numpy: 1.11.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-01-11T13:29:54Z

Thanks for catching that. Nothing special about 'a' by the way. Any label on the outer level will show the same result.

jreback · 2017-01-11T13:33:51Z

hmm, indexing with nans if fraught with complication. the indexer must be off here.

pull requests to fix are welcome.

theodoreschen · 2017-05-24T22:53:52Z

I'm checking out pandas through the pycon dev sprint, and I'm taking a stab at the issue.

I'm able to replicate the bug, but instead of getting 'a' to replace the np.nan, I see np.nan (on a Windows box running Python3.6).

(I think) I've traced the issue down to the Index._join_level() function in pandas/core/indexes/base.py module. Will report back when I've nailed down the offending line.

theodoreschen · 2017-05-24T23:47:29Z

I think I found the issue...

In pandas/core/indexes/base.py, Index._join_level() is the following line:
new_lev_labels = algos.take_nd(rev_indexer, left.labels[level], allow_fill=False)

Running PDB these are the values going into algos.take_nd:

(Pdb) level
0
(Pdb) rev_indexer
array([ 0, -1, -1, -1], dtype=int64)
(Pdb) left
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3, 4, 5]], labels=[[0, 1, 2, 3, -1], [0, 1, 2, 3, 4]], names=['ind1', 'ind2'])

algos.take_nd() pads 0's when allow_fill is False, pads with nan when allow_fill is True. The subsequent code takes any index of new_lev_label with 0's and returns the corresponding rows in the original DataFrame object (which is why the row labeled with np.NaN returned when it shouldn't).

I think the options are either prevent np.NaN values in indices or in base.py set allow_fill=True and replace nan's with -1.

toobaz · 2019-06-28T19:16:37Z

The problem itself is fixed:

In [3]: df = pd.DataFrame({'col':[1, 2, 3, 4, 5], 'ind1':['a','b','c','d',np.nan], 'ind2':[1,2,3,4,5] }) 
   ...: df.set_index(['ind1','ind2'],inplace=True)                                                                                                                                                                                            

In [4]: df.loc[['a']]                                                                                                                                                                                                                         
Out[4]: 
           col
ind1 ind2     
a    1       1
NaN  5       5

In [5]: df.loc[['a']].reset_index()                                                                                                                                                                                                           
Out[5]: 
  ind1  ind2  col
0    a     1    1
1  NaN     5    5

and tested:

pandas/pandas/tests/frame/test_alter_axes.py

Line 1120 in 3a53954

rs = df.set_index(['A', 'B']).reset_index()

... but still not Out[4], which should not include the second line. I opened #27104 for clarity.

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Jan 11, 2017

TomAugspurger added this to the 0.20.0 milestone Jan 11, 2017

jreback added Difficulty Intermediate labels Jan 11, 2017

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

toobaz mentioned this issue Jun 28, 2019

Partial indexing with list on MultiIndex with missing value includes them despite not being in the list #27104

Closed

toobaz closed this as completed Jun 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

unexpected behavior with NaN in multiIndex #15107

unexpected behavior with NaN in multiIndex #15107

micheledemeo commented Jan 11, 2017

TomAugspurger commented Jan 11, 2017

Uh oh!

jreback commented Jan 11, 2017

Uh oh!

theodoreschen commented May 24, 2017 •

edited

Loading

Uh oh!

theodoreschen commented May 24, 2017 •

edited

Loading

Uh oh!

toobaz commented Jun 28, 2019

Uh oh!

Uh oh!

unexpected behavior with NaN in multiIndex #15107

unexpected behavior with NaN in multiIndex #15107

Comments

micheledemeo commented Jan 11, 2017

Problem description

TomAugspurger commented Jan 11, 2017

Uh oh!

jreback commented Jan 11, 2017

Uh oh!

theodoreschen commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theodoreschen commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toobaz commented Jun 28, 2019

Uh oh!

theodoreschen commented May 24, 2017 •

edited

Loading

theodoreschen commented May 24, 2017 •

edited

Loading