Skip to content

unexpected behavior with NaN in multiIndex #15107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
micheledemeo opened this issue Jan 11, 2017 · 5 comments
Closed

unexpected behavior with NaN in multiIndex #15107

micheledemeo opened this issue Jan 11, 2017 · 5 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex

Comments

@micheledemeo
Copy link

df = pd.DataFrame({'col':[1, 2, 3, 4, 5], 'ind1':['a','b','c','d',np.nan], 'ind2':[1,2,3,4,5] })
df.set_index(['ind1','ind2'],inplace=True)

df.loc['a']
#     col
# ind2
# 1     1

df.loc[['a']]
#            col
# ind1 ind2
# a    1       1
#      5       5

df.loc[['a']].reset_index()
#   ind1  ind2  col
# 0    a     1    1
# 1    a     5    5

Problem description

Is it normal replacing np.nan with 'a' ?

INSTALLED VERSIONS ------------------ commit: None python: 2.7.5.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.28.3.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 32.0.0
Cython: None
numpy: 1.11.3
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Jan 11, 2017
@TomAugspurger
Copy link
Contributor

Thanks for catching that. Nothing special about 'a' by the way. Any label on the outer level will show the same result.

@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Jan 11, 2017
@jreback
Copy link
Contributor

jreback commented Jan 11, 2017

hmm, indexing with nans if fraught with complication. the indexer must be off here.

pull requests to fix are welcome.

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@theodoreschen
Copy link

theodoreschen commented May 24, 2017

I'm checking out pandas through the pycon dev sprint, and I'm taking a stab at the issue.

I'm able to replicate the bug, but instead of getting 'a' to replace the np.nan, I see np.nan (on a Windows box running Python3.6).

(I think) I've traced the issue down to the Index._join_level() function in pandas/core/indexes/base.py module. Will report back when I've nailed down the offending line.

@theodoreschen
Copy link

theodoreschen commented May 24, 2017

I think I found the issue...

In pandas/core/indexes/base.py, Index._join_level() is the following line:
new_lev_labels = algos.take_nd(rev_indexer, left.labels[level], allow_fill=False)

Running PDB these are the values going into algos.take_nd:

(Pdb) level
0
(Pdb) rev_indexer
array([ 0, -1, -1, -1], dtype=int64)
(Pdb) left
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3, 4, 5]], labels=[[0, 1, 2, 3, -1], [0, 1, 2, 3, 4]], names=['ind1', 'ind2'])

algos.take_nd() pads 0's when allow_fill is False, pads with nan when allow_fill is True. The subsequent code takes any index of new_lev_label with 0's and returns the corresponding rows in the original DataFrame object (which is why the row labeled with np.NaN returned when it shouldn't).

I think the options are either prevent np.NaN values in indices or in base.py set allow_fill=True and replace nan's with -1.

@toobaz
Copy link
Member

toobaz commented Jun 28, 2019

The problem itself is fixed:

In [3]: df = pd.DataFrame({'col':[1, 2, 3, 4, 5], 'ind1':['a','b','c','d',np.nan], 'ind2':[1,2,3,4,5] }) 
   ...: df.set_index(['ind1','ind2'],inplace=True)                                                                                                                                                                                            

In [4]: df.loc[['a']]                                                                                                                                                                                                                         
Out[4]: 
           col
ind1 ind2     
a    1       1
NaN  5       5

In [5]: df.loc[['a']].reset_index()                                                                                                                                                                                                           
Out[5]: 
  ind1  ind2  col
0    a     1    1
1  NaN     5    5

and tested:

rs = df.set_index(['A', 'B']).reset_index()

... but still not Out[4], which should not include the second line. I opened #27104 for clarity.

@toobaz toobaz closed this as completed Jun 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants