Skip to content

DataFrame.reset_index deletes index, does not all for ints as level arg #16263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m4g005 opened this issue May 6, 2017 · 7 comments · Fixed by #16266
Closed

DataFrame.reset_index deletes index, does not all for ints as level arg #16263

m4g005 opened this issue May 6, 2017 · 7 comments · Fixed by #16266
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@m4g005
Copy link

m4g005 commented May 6, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
pd.__version__

u'0.20.1'

data = pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']) 
data
A B C D
0 -0.134549 2.352525 0.481132 -1.919506
1 1.980074 0.720437 0.410702 -0.703470
2 -3.063166 -0.781255 0.270469 -0.539081
3 -1.125265 0.308374 -0.166085 -1.253959
data.set_index(['A'], inplace=True)
data
B C D
A
-0.134549 2.352525 0.481132 -1.919506
1.980074 0.720437 0.410702 -0.703470
-3.063166 -0.781255 0.270469 -0.539081
-1.125265 0.308374 -0.166085 -1.253959
data.reset_index(level=['A'], inplace=True)
data
B C D
0 2.352525 0.481132 -1.919506
1 0.720437 0.410702 -0.703470
2 -0.781255 0.270469 -0.539081
3 0.308374 -0.166085 -1.253959

Problem description

between v0.19.2 and v0.20.1, the behavior of DataFrame.reset_index changed.
With a single set index:

  • It does not attempt to keep the column (essentially making drop=True always on)
  • level=int no longer works (iterables work)

Expected Output

v0.19.2 results:

data.reset_index(level=['A'], inplace=True)
data
A B C D
0 0.100442 -0.620740 -2.018020 1.059871
1 -0.530272 0.402598 -1.453445 -0.729623
2 -1.040126 -0.536687 -1.136123 -0.748891
3 -0.269727 0.182250 0.847344 0.785692

Output of pd.show_versions()

import pandas as pd
import numpy as np
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 35.0.2
Cython: None
numpy: 1.12.1
scipy: None
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
@toobaz
Copy link
Member

toobaz commented May 6, 2017

This is related to passing the level= argument when there is a non-MultiIndex. Before, the argument would be just discarded:

In [3]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level='not present')
Out[3]: 
   index         A         B         C         D
0      0  0.057457  0.065932  0.276079  0.305390
1      1 -0.562195 -0.385750 -0.228925 -0.426511
2      2  0.377559 -0.837031 -0.384840 -0.305262
3      3 -0.670057 -0.737446  0.561989  0.528754

In [4]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).reset_index(level=['not present'])
Out[4]: 
   index         A         B         C         D
0      0  0.613373 -0.169316 -0.592379  1.050764
1      1  0.069762  0.995308  0.030434 -0.361300
2      2 -0.526487  0.165054  0.015452  0.954447
3      3  0.585677 -1.435712 -0.298280 -0.581473

but a7a0574 changed the behaviour so that now, vice-versa, even valid level names/indices are not considered.

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

@toobaz
Copy link
Member

toobaz commented May 6, 2017

I can provide a PR, the question is what we want to do with non-existent level names: raise or ignore?

Sorry, the question is already answered by the behaviour when there is a MultiIndex:

In [5]: pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    610                                  'level number' % level)
--> 611             level = self.names.index(level)
    612         except ValueError:

ValueError: 'E' is not in list

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-f1686d7d4dfc> in <module>()
----> 1 pd.DataFrame(np.random.randn(4,4), columns=['A', 'B', 'C', 'D']).set_index(['A', 'B']).reset_index(level=['A', 'E'])

/home/nobackup/repo/pandas/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/frame.py in <listcomp>(.0)
   3016             if not isinstance(level, (tuple, list)):
   3017                 level = [level]
-> 3018             level = [self.index._get_level_number(lev) for lev in level]
   3019         if isinstance(self.index, MultiIndex):
   3020             if len(level) < self.index.nlevels:

/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
    612         except ValueError:
    613             if not isinstance(level, int):
--> 614                 raise KeyError('Level %s not found' % str(level))
    615             elif level < 0:
    616                 level += self.nlevels

KeyError: 'Level E not found'

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label May 6, 2017
@jorisvandenbossche jorisvandenbossche added this to the 0.20.2 milestone May 6, 2017
@jorisvandenbossche
Copy link
Member

@m4g005 Thanks for the report! And @toobaz for the quick analysis.
This is indeed a regression (although it seems it was more working by accident before)

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label May 6, 2017
@schwab
Copy link

schwab commented May 6, 2017

Perhaps it was working by accident before, but the new behavior of completely dropping the index column when reset index is called seems problematic. Additionally, according to the docs for reset_index "For a standard index, the index name will be used..." which indicates now it's even out of sync with the documented spec. It also brings up the important question, if we do want to keep this behavior going forward, then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

@schwab
Copy link

schwab commented May 6, 2017

@m4g005 To get this working the same in both version, you can try it without the level name.

`data.reset_index(inplace=True)

data`

A B C D
0 1.11556 1.21351 -0.185124 0.868765
1 1.63402 0.322284 0.299842 -0.174827
2 -1.21852 -0.35271 0.773597 1.62995
3 -0.416348 -0.113201 -0.151533 -1.01033

@toobaz
Copy link
Member

toobaz commented May 6, 2017

then what is the new "correct" way to remove a single column index from a dataframe while keeping its data?

This is going to be fixed, no doubt.

@jorisvandenbossche
Copy link
Member

Indeed, @schwab, as I confirmed above, this is a regression, it is supposed to work, and @toobaz already made a PR to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants