Skip to content

Unable to read MultiIndex columns from CSV if empty levels #13054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jluttine opened this issue May 2, 2016 · 13 comments · Fixed by #44657
Closed

Unable to read MultiIndex columns from CSV if empty levels #13054

jluttine opened this issue May 2, 2016 · 13 comments · Fixed by #44657
Assignees
Labels
Milestone

Comments

@jluttine
Copy link

jluttine commented May 2, 2016

If I use MultiIndex columns and if a level happens to have empty values for all columns, the saved CSV file cannot be read. I expected to recover the dataframe from the saved CSV perfectly.

I believe #6618 might be related, because this is somehow related to how Pandas uses an empty data row to separate column names and actual data when using MultiIndex columns.

Code Sample, a copy-pastable example if possible

This works as expected:

In [1]: pd.DataFrame({('a','b'): [1, 2], ('c','d'): [3, 4]}).to_csv('temp.csv', index=False)

In [2]: pd.read_csv('temp.csv', header=[0,1])
Out[2]: 
   a  c
   b  d
0  1  3
1  2  4

However, if a level is empty (i.e., all columns are '' on that level), it doesn't work:

In [3]: pd.DataFrame({('a',''): [1, 2], ('c',''): [3, 4]}).to_csv('temp.csv', index=False)

In [4]: pd.read_csv('temp.csv', header=[0,1])
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-73-9f097e07e5a9> in <module>()
----> 1 pd.read_csv('temp.csv', header=[0,1])

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    527                     skip_blank_lines=skip_blank_lines)
    528 
--> 529         return _read(filepath_or_buffer, kwds)
    530 
    531     parser_f.__name__ = name

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    293 
    294     # Create the parser.
--> 295     parser = TextFileReader(filepath_or_buffer, **kwds)
    296 
    297     if (nrows is not None) and (chunksize is not None):

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    610             self.options['has_index_names'] = kwds['has_index_names']
    611 
--> 612         self._make_engine(self.engine)
    613 
    614     def _get_options_with_defaults(self, engine):

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    745     def _make_engine(self, engine='c'):
    746         if engine == 'c':
--> 747             self._engine = CParserWrapper(self.f, **self.options)
    748         else:
    749             if engine == 'python':

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1132                     self._extract_multi_indexer_columns(
   1133                         self._reader.header, self.index_names, self.col_names,
-> 1134                         passed_names
   1135                     )
   1136                 )

/usr/lib/python3.5/site-packages/pandas/io/parsers.py in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    906                     "Passed header=[%s] are too many rows for this "
    907                     "multi_index of columns"
--> 908                     % ','.join([str(x) for x in self.header])
    909                 )
    910 

CParserError: Passed header=[0,1] are too many rows for this multi_index of columns

Expected Output

Expected that the empty columns are read correctly because I had explicitly specified the rows to use as column index:

   a  c

0  1  3
1  2  4

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.5.2-gnu-1
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_DK.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.0
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
@jreback
Copy link
Contributor

jreback commented May 2, 2016

I suppose this could be fixed, very odd case though. Using empty levels is really confusing / prob should be banned.

@jreback jreback added this to the Next Major Release milestone May 2, 2016
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 27, 2016

From a discussion with @mralgos this morning, there seems to be a specific check in the implementation about this:

pandas/pandas/io/parsers.py

Lines 1124 to 1132 in 9d10b76

# if we find 'Unnamed' all of a single level, then our header was too
# long
for n in range(len(columns[0])):
if all(['Unnamed' in tostr(c[n]) for c in columns]):
raise CParserError(
"Passed header=[%s] are too many rows for this "
"multi_index of columns"
% ','.join([str(x) for x in self.header])
)

So it seems this is done on purpose. The question is also what would be the expected output of this should not error.
Removing the above mentioned code lines, gives you this output

s = """a,b
,
1,2
3,4"""

In [4]: pd.read_csv(StringIO(s), header=[0,1])
Out[4]: 
                   a                  b
  Unnamed: 0_level_1 Unnamed: 1_level_1
0                  1                  2
1                  3                  4

While @jluttine expects a second index level with NaNs?

@jluttine
Copy link
Author

Yep, I would have expected a second index level with NaNs.

@jorisvandenbossche
Copy link
Member

cc @gfyoung @chris-b1 Any opinion on whether we should raise an error, fill empty level with 'Unnamed ...' (see #13054 (comment)), or fill empty level with NaNs?

So case is:

s = """a,b
,
1,2
3,4"""

pd.read_csv(StringIO(s), header=[0,1])

If we don't raise an error, filling with 'Unnamed ..' seems more in line with other cases where a column label is missing.

@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2016

I would agree that Unnamed, while a little aesthetically unpleasing, is consistent with what we would do as you mentioned @jorisvandenbossche should we choose not to raise an error. I agree with @jreback that this is an unusual use case, but then again, I see no reason not to ban it.

@den-run-ai
Copy link

I don't like this "Unnamed..." thing in the multilevel columns, when reading from CSV. Some levels of the column are filled only partially in my case. The empty levels of some columns should probably be imported as NaNs. The reason is that it looks less ugly and can be easily replaced with empty string or whatever the user wishes.

@gfyoung
Copy link
Member

gfyoung commented Jan 16, 2017

@denfromufa : I agree that it is not aesthetically as pleasing, but IMO we should also keep in mind that clarity is of great importance. Unnamed: ... sticks out very clearly in the result to indicate that something was missing in the header provided. In addition, the naming system protects against duplicate levels, which can make indexing more difficult.

So while I do agree about the aesthetics, I would not be in favor of using NaN as placeholders. While I am not against banning the example provided above, I am inclined to stick the Unnamed paradigm (which we use in single-level columns) for the time being.

Perhaps it is best that we first settle (or regather consensus on) the initial matter: is the example in the issue allowed, or should we ban it? If we decide to ban it, your point (and all other points about the naming) are somewhat moot. However, we largely agree that the example should be allowed, then we can perhaps open the discussion up about the naming scheme.

Personally, I am okay with allowing this example through. It's unusual, but I see no reason to ban it. Thoughts, anyone?

@den-run-ai
Copy link

den-run-ai commented Jan 16, 2017 via email

@gfyoung
Copy link
Member

gfyoung commented Jan 16, 2017

@denfromufa : I hate to break it to you, but there is no such thing as a general-purpose solution. 😄 That's why I'm asking for consensus before we do anything.

@GabrielValeRios
Copy link

@denfromufa : I agree that it is not aesthetically as pleasing, but IMO we should also keep in mind that clarity is of great importance. Unnamed: ... sticks out very clearly in the result to indicate that something was missing in the header provided. In addition, the naming system protects against duplicate levels, which can make indexing more difficult.

So while I do agree about the aesthetics, I would not be in favor of using NaN as placeholders. While I am not against banning the example provided above, I am inclined to stick the Unnamed paradigm (which we use in single-level columns) for the time being.

Perhaps it is best that we first settle (or regather consensus on) the initial matter: is the example in the issue allowed, or should we ban it? If we decide to ban it, your point (and all other points about the naming) are somewhat moot. However, we largely agree that the example should be allowed, then we can perhaps open the discussion up about the naming scheme.

Personally, I am okay with allowing this example through. It's unusual, but I see no reason to ban it. Thoughts, anyone?

I'm new at contributing to pandas, and I would like to know if this issue is still relevant to solve. Also, I agree with @gfyoung that Unnamed could be a better placeholder than NaN.

@fpunny
Copy link

fpunny commented Feb 28, 2020

Is it possible for me to pick this ticket up? I haven't really worked with Pandas before and would like to do it to get familar

@sfsinger19103
Copy link

Has this been implemented yet? If so, in which pandas version? @fpunny

@smith120bh
Copy link

I'll just throw an extra opinion into here, where I'm very much in favour of the "Unnamed" approach here rather than error'ing out. My use case is that we have a standard format for some Excel files we want to let staff import, which includes three rows of headers (label, symbol, and units). However, there are cases where, for example, none of the columns should have a unit attached to them. We've told our staff that they should just always put a random space in the heading rows of their spreadsheet if they get a weird error message. While functional, adding a random space is a really silly solution.

@jreback jreback modified the milestones: Contributions Welcome, 1.4 Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.