Unable to read MultiIndex columns from CSV if empty levels #13054

jluttine · 2016-05-02T06:33:38Z

If I use MultiIndex columns and if a level happens to have empty values for all columns, the saved CSV file cannot be read. I expected to recover the dataframe from the saved CSV perfectly.

I believe #6618 might be related, because this is somehow related to how Pandas uses an empty data row to separate column names and actual data when using MultiIndex columns.

Code Sample, a copy-pastable example if possible

This works as expected:

In [1]: pd.DataFrame({('a','b'): [1, 2], ('c','d'): [3, 4]}).to_csv('temp.csv', index=False)

In [2]: pd.read_csv('temp.csv', header=[0,1])
Out[2]: 
   a  c
   b  d
0  1  3
1  2  4

However, if a level is empty (i.e., all columns are '' on that level), it doesn't work:

In [3]: pd.DataFrame({('a',''): [1, 2], ('c',''): [3, 4]}).to_csv('temp.csv', index=False)

In [4]: pd.read_csv('temp.csv', header=[0,1])
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-73-9f097e07e5a9> in <module>()
----> 1 pd.read_csv('temp.csv', header=[0,1])

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    527                     skip_blank_lines=skip_blank_lines)
    528 
--> 529         return _read(filepath_or_buffer, kwds)
    530 
    531     parser_f.__name__ = name

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    293 
    294     # Create the parser.
--> 295     parser = TextFileReader(filepath_or_buffer, **kwds)
    296 
    297     if (nrows is not None) and (chunksize is not None):

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    610             self.options['has_index_names'] = kwds['has_index_names']
    611 
--> 612         self._make_engine(self.engine)
    613 
    614     def _get_options_with_defaults(self, engine):

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    745     def _make_engine(self, engine='c'):
    746         if engine == 'c':
--> 747             self._engine = CParserWrapper(self.f, **self.options)
    748         else:
    749             if engine == 'python':

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1132                     self._extract_multi_indexer_columns(
   1133                         self._reader.header, self.index_names, self.col_names,
-> 1134                         passed_names
   1135                     )
   1136                 )

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    906                     "Passed header=[%s] are too many rows for this "
    907                     "multi_index of columns"
--> 908                     % ','.join([str(x) for x in self.header])
    909                 )
    910 

CParserError: Passed header=[0,1] are too many rows for this multi_index of columns

Expected Output

Expected that the empty columns are read correctly because I had explicitly specified the rows to use as column index:

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.5.2-gnu-1
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_DK.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-05-02T13:03:04Z

I suppose this could be fixed, very odd case though. Using empty levels is really confusing / prob should be banned.

jorisvandenbossche · 2016-08-27T13:24:24Z

From a discussion with @mralgos this morning, there seems to be a specific check in the implementation about this:

pandas/pandas/io/parsers.py

Lines 1124 to 1132 in 9d10b76

    
           # if we find 'Unnamed' all of a single level, then our header was too 
        
           # long 
        
           for n in range(len(columns[0])): 
        
               if all(['Unnamed' in tostr(c[n]) for c in columns]): 
        
                   raise CParserError( 
        
                       "Passed header=[%s] are too many rows for this " 
        
                       "multi_index of columns" 
        
                       % ','.join([str(x) for x in self.header]) 
        
                   )

So it seems this is done on purpose. The question is also what would be the expected output of this should not error.
Removing the above mentioned code lines, gives you this output

s = """a,b
,
1,2
3,4"""

In [4]: pd.read_csv(StringIO(s), header=[0,1])
Out[4]: 
                   a                  b
  Unnamed: 0_level_1 Unnamed: 1_level_1
0                  1                  2
1                  3                  4

While @jluttine expects a second index level with NaNs?

jluttine · 2016-08-29T06:20:20Z

Yep, I would have expected a second index level with NaNs.

jorisvandenbossche · 2016-08-31T11:02:37Z

cc @gfyoung @chris-b1 Any opinion on whether we should raise an error, fill empty level with 'Unnamed ...' (see #13054 (comment)), or fill empty level with NaNs?

So case is:

s = """a,b
,
1,2
3,4"""

pd.read_csv(StringIO(s), header=[0,1])

If we don't raise an error, filling with 'Unnamed ..' seems more in line with other cases where a column label is missing.

gfyoung · 2016-09-01T03:16:45Z

I would agree that Unnamed, while a little aesthetically unpleasing, is consistent with what we would do as you mentioned @jorisvandenbossche should we choose not to raise an error. I agree with @jreback that this is an unusual use case, but then again, I see no reason not to ban it.

den-run-ai · 2017-01-16T05:46:45Z

I don't like this "Unnamed..." thing in the multilevel columns, when reading from CSV. Some levels of the column are filled only partially in my case. The empty levels of some columns should probably be imported as NaNs. The reason is that it looks less ugly and can be easily replaced with empty string or whatever the user wishes.

gfyoung · 2017-01-16T06:44:00Z

@denfromufa : I agree that it is not aesthetically as pleasing, but IMO we should also keep in mind that clarity is of great importance. Unnamed: ... sticks out very clearly in the result to indicate that something was missing in the header provided. In addition, the naming system protects against duplicate levels, which can make indexing more difficult.

So while I do agree about the aesthetics, I would not be in favor of using NaN as placeholders. While I am not against banning the example provided above, I am inclined to stick the Unnamed paradigm (which we use in single-level columns) for the time being.

Perhaps it is best that we first settle (or regather consensus on) the initial matter: is the example in the issue allowed, or should we ban it? If we decide to ban it, your point (and all other points about the naming) are somewhat moot. However, we largely agree that the example should be allowed, then we can perhaps open the discussion up about the naming scheme.

Personally, I am okay with allowing this example through. It's unusual, but I see no reason to ban it. Thoughts, anyone?

den-run-ai · 2017-01-16T07:05:35Z

I ended up not defining the header for columns while reading CSV, and later combining these rows representing the levels into one string or tuple with "" or None used for NaNs respectively. This is not general-purpose solution.

…

On Mon, Jan 16, 2017, 12:44 AM gfyoung ***@***.***> wrote: @denfromufa <https://github.com/denfromufa> : I agree that it is not aesthetically as pleasing, but IMO we should also keep in mind that clarity is of great importance. Unnamed: ... sticks out very clearly in the result to indicate that something was missing in the header provided. In addition, the naming system protects against duplicate levels, which can make indexing more difficult. So while I do agree about the aesthetics, I would not be in favor of using NaN as placeholders. While I am not against banning the example provided above, I am inclined to stick the Unnamed paradigm (which we use in single-level columns) for the time being. Perhaps it is best that we first settle (or regather consensus on) the initial matter: is the example in the issue allowed, or should we ban it? If we decide to ban it, your point (and all other points about the naming) are somewhat moot. However, we largely agree that the example should be allowed, then we can perhaps open the discussion up about the naming scheme. Personally, I am okay with allowing this example through. It's unusual, but I see no reason to ban it. Thoughts, anyone? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#13054 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHgZ5fmJA4dY7kAzrryzO-ExMTzH3Wdjks5rSxG5gaJpZM4IVJZg> .

gfyoung · 2017-01-16T07:11:26Z

@denfromufa : I hate to break it to you, but there is no such thing as a general-purpose solution. 😄 That's why I'm asking for consensus before we do anything.

GabrielValeRios · 2019-08-27T12:20:16Z

@denfromufa : I agree that it is not aesthetically as pleasing, but IMO we should also keep in mind that clarity is of great importance. Unnamed: ... sticks out very clearly in the result to indicate that something was missing in the header provided. In addition, the naming system protects against duplicate levels, which can make indexing more difficult.

So while I do agree about the aesthetics, I would not be in favor of using NaN as placeholders. While I am not against banning the example provided above, I am inclined to stick the Unnamed paradigm (which we use in single-level columns) for the time being.

Perhaps it is best that we first settle (or regather consensus on) the initial matter: is the example in the issue allowed, or should we ban it? If we decide to ban it, your point (and all other points about the naming) are somewhat moot. However, we largely agree that the example should be allowed, then we can perhaps open the discussion up about the naming scheme.

Personally, I am okay with allowing this example through. It's unusual, but I see no reason to ban it. Thoughts, anyone?

I'm new at contributing to pandas, and I would like to know if this issue is still relevant to solve. Also, I agree with @gfyoung that Unnamed could be a better placeholder than NaN.

fpunny · 2020-02-28T22:00:02Z

Is it possible for me to pick this ticket up? I haven't really worked with Pandas before and would like to do it to get familar

sfsinger19103 · 2021-02-12T20:25:07Z

Has this been implemented yet? If so, in which pandas version? @fpunny

smith120bh · 2021-10-07T11:04:13Z

I'll just throw an extra opinion into here, where I'm very much in favour of the "Unnamed" approach here rather than error'ing out. My use case is that we have a standard format for some Excel files we want to let staff import, which includes three rows of headers (label, symbol, and units). However, there are cases where, for example, none of the columns should have a unit attached to them. We've told our staff that they should just always put a random space in the heading rows of their spreadsheet if they get a weird error message. While functional, adding a random space is a really silly solution.

jreback added Difficulty Novice IO CSV read_csv, to_csv MultiIndex labels May 2, 2016

jreback added this to the Next Major Release milestone May 2, 2016

TomAugspurger added the good first issue label Oct 11, 2017

jreback removed the Difficulty Novice label Dec 15, 2017

jbrockmendel removed the Effort Low label Oct 21, 2019

github-actions bot assigned fpunny Mar 7, 2020

fpunny mentioned this issue Mar 7, 2020

Find open issues CSCD01/team_24-project#1

Closed

3 tasks

mroeschke added Bug and removed good first issue labels Apr 24, 2021

phofl mentioned this issue Nov 28, 2021

Stop raising in read_csv when header row contains only empty cells #44657

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Nov 29, 2021

jreback closed this as completed in #44657 Dec 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to read MultiIndex columns from CSV if empty levels #13054

Unable to read MultiIndex columns from CSV if empty levels #13054

jluttine commented May 2, 2016

jreback commented May 2, 2016

jorisvandenbossche commented Aug 27, 2016 •

edited

Loading

jluttine commented Aug 29, 2016

jorisvandenbossche commented Aug 31, 2016

gfyoung commented Sep 1, 2016

den-run-ai commented Jan 16, 2017

gfyoung commented Jan 16, 2017

den-run-ai commented Jan 16, 2017 via email

gfyoung commented Jan 16, 2017

GabrielValeRios commented Aug 27, 2019

fpunny commented Feb 28, 2020

sfsinger19103 commented Feb 12, 2021

smith120bh commented Oct 7, 2021

Unable to read MultiIndex columns from CSV if empty levels #13054

Unable to read MultiIndex columns from CSV if empty levels #13054

Comments

jluttine commented May 2, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

jreback commented May 2, 2016

jorisvandenbossche commented Aug 27, 2016 • edited Loading

jluttine commented Aug 29, 2016

jorisvandenbossche commented Aug 31, 2016

gfyoung commented Sep 1, 2016

den-run-ai commented Jan 16, 2017

gfyoung commented Jan 16, 2017

den-run-ai commented Jan 16, 2017 via email

gfyoung commented Jan 16, 2017

GabrielValeRios commented Aug 27, 2019

fpunny commented Feb 28, 2020

sfsinger19103 commented Feb 12, 2021

smith120bh commented Oct 7, 2021

output of `pd.show_versions()`

jorisvandenbossche commented Aug 27, 2016 •

edited

Loading