Skip to content

set_levels for MultiIndex columns with blank strings #16214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dinya opened this issue May 3, 2017 · 7 comments
Open

set_levels for MultiIndex columns with blank strings #16214

dinya opened this issue May 3, 2017 · 7 comments

Comments

@dinya
Copy link

dinya commented May 3, 2017

I want to rename column in MultiIndex with

import pandas as pd

def rename_columns(df, template):
    for i, columns in enumerate(df.columns.levels):
        columns = columns.tolist()
        for j, row in enumerate(columns):
            if template in row:
                columns[j] = ""
        df.columns.set_levels(columns, level=i, inplace=True)

and reset_index()

This code

df = pd.DataFrame([[1,2,3], [10,20,30]])
df.columns = pd.MultiIndex.from_tuples([("a", "a1"), ("b", "b1"), ("c", "c1")])
rename_columns(df, "a1")
df.reset_index()

works well

But if I use [("a", "a1"), ("b", ""), ("c", "c1")] istead of [("a", "a1"), ("b", "b1"), ("c", "c1")] fro columns names

df = pd.DataFrame([[1,2,3], [10,20,30]])
df.columns = pd.MultiIndex.from_tuples([("a", "a1"), ("b", ""), ("c", "c1")])
rename_columns(df, "a1")
df.reset_index()

the code returns

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\formats\format.pyc in to_string(self)
    565             text = info_line
    566         else:
--> 567             strcols = self._to_str_columns()
    568             if self.line_width is None:  # no need to wrap around just print
    569                 # the whole frame

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\formats\format.pyc in _to_str_columns(self)
    488         if self.header:
    489             stringified = []
--> 490             for i, c in enumerate(frame):
    491                 cheader = str_columns[i]
    492                 max_colwidth = max(self.col_space or 0, *(self.adj.len(x)

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\generic.pyc in __iter__(self)
    833     def __iter__(self):
    834         """Iterate over infor axis"""
--> 835         return iter(self._info_axis)
    836 
    837     # can we get a better explanation of this?

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\indexes\base.pyc in __iter__(self)
   1344 
   1345     def __iter__(self):
-> 1346         return iter(self.values)
   1347 
   1348     def __reduce__(self):

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\indexes\multi.pyc in values(self)
    628             values.append(taken)
    629 
--> 630         self._tuples = lib.fast_zip(values)
    631         return self._tuples
    632 

pandas\lib.pyx in pandas.lib.fast_zip (pandas\lib.c:11630)()

ValueError: all arrays must be same length

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.12.1
scipy: 0.19.0
statsmodels: None
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
boto: None
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

A bit simpler of an example:

In [44]: columns = pd.MultiIndex([['a', 'b', 'c'], ['', '', 'c1']], [[0, 1, 2], [1, 0, 2]])

In [45]: pd.DataFrame(columns=columns).reset_index()

Presumably because your df.columns has duplicates in the second level. I'll take a closer look later.

@dinya
Copy link
Author

dinya commented May 3, 2017

@TomAugspurger
Yes, it is. Is it pandas restriction? Is only way is renaming duplicates before reset_index() temporary and renaming them to "" after operation again?

@dinya
Copy link
Author

dinya commented May 3, 2017

BTW,

import pandas as pd

def rename_columns(df, template):
    for i, columns in enumerate(df.columns.levels):
        columns = columns.tolist()
        for j, row in enumerate(columns):
            if template in row:
                columns[j] = ""
        df.columns.set_levels(columns, level=i, inplace=True)
        
        
df = pd.DataFrame([[1,2,3], [10,20,30]])
df.columns = pd.MultiIndex.from_tuples([("a", "a1"), ("b", "b1"), ("c", "c1")])
rename_columns(df, "a1")

df["d"] = 1
df

works well, but

df = pd.DataFrame([[1,2,3], [10,20,30]])
df.columns = pd.MultiIndex.from_tuples([("a", "a1"), ("b", ""), ("c", "c1")])
rename_columns(df, "a1")
df["d"] = 1
df

returns

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-33-515550a9c85a> in <module>()
      2 df.columns = pd.MultiIndex.from_tuples([("a", "a1"), ("b", ""), ("c", "c1")])
      3 rename_columns(df, "a1")
----> 4 df["d"] = 1
      5 df

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\frame.pyc in __setitem__(self, key, value)
   2417         else:
   2418             # set column
-> 2419             self._set_item(key, value)
   2420 
   2421     def _setitem_slice(self, key, value):

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\frame.pyc in _set_item(self, key, value)
   2484         self._ensure_valid_index(value)
   2485         value = self._sanitize_column(key, value)
-> 2486         NDFrame._set_item(self, key, value)
   2487 
   2488         # check if we are modifying a copy

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\generic.pyc in _set_item(self, key, value)
   1498 
   1499     def _set_item(self, key, value):
-> 1500         self._data.set(key, value)
   1501         self._clear_item_cache()
   1502 

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\internals.pyc in set(self, item, value, check)
   3669         except KeyError:
   3670             # This item wasn't present, just insert at end
-> 3671             self.insert(len(self.items), item, value)
   3672             return
   3673 

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\core\internals.pyc in insert(self, loc, item, value, allow_duplicates)
   3767 
   3768         # insert to the axis; this could possibly raise a TypeError
-> 3769         new_axis = self.items.insert(loc, item)
   3770 
   3771         block = make_block(values=value, ndim=self.ndim,

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\pandas\indexes\multi.pyc in insert(self, loc, item)
   2241 
   2242             new_levels.append(level)
-> 2243             new_labels.append(np.insert(_ensure_int64(labels), loc, lev_loc))
   2244 
   2245         return MultiIndex(levels=new_levels, labels=new_labels,

C:\Users\USER\AppData\Local\Continuum\Miniconda2\lib\site-packages\numpy\lib\function_base.pyc in insert(arr, obj, values, axis)
   4896         # There are some object array corner cases here, but we cannot avoid
   4897         # that:
-> 4898         values = array(values, copy=False, ndmin=arr.ndim, dtype=arr.dtype)
   4899         if indices.ndim == 0:
   4900             # broadcasting is very different here, since a[:,0,:] = ... behaves

TypeError: long() argument must be a string or a number, not 'slice'

Is the reason is the same?

@TomAugspurger
Copy link
Contributor

Yes, it is. Is it pandas restriction?

I don't recall if / where it's enforced. In the docstring for MI, we do say

levels : sequence of arrays
    The unique labels for each level

In general, directly manipulating the labels / levels can get you into sticky situations.

And as you say, it's not just that there are duplicates. It seems like the empty string matters.

@chris-b1
Copy link
Contributor

chris-b1 commented May 3, 2017

xref #11424 - empty string MI labels sometimes get special treatment, guessing it could be related.

@jreback
Copy link
Contributor

jreback commented May 6, 2017

cc @toobaz

@ghost
Copy link

ghost commented May 17, 2020

Hi everyone ! This issue is very old and is still open. Is the problem still relevant ? Should we label it asking for help ?

@jreback jreback added this to the Contributions Welcome milestone May 17, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants