Skip to content

BUG: read_excel doesn't honor dtype for index #35816

Open
@akaihola

Description

@akaihola
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Make sure you have Pandas and xlrd installed in a virtualenv, and launch the Python interpreter. You'll need openpyxl only if you decide to create the Excel file from Python (see below).

python3 -m venv /tmp/pandas-issue-35816
. /tmp/pandas-issue-35816/bin/activate
pip install pandas xlrd openpyxl
python

To get the Excel files for reproducing the bug, you can either

  1. download index_dtype.xls (note: attachment gzipped) or index_dtype.xlsx, or
  2. copy-paste the text inside Details below into a index_dtype.csv file, import that into Excel, and save it in Excel format as index_dtype.xlsx, or
    string,value for string index
    42,value for int index
  3. create an Excel file from Python by copy-pasting from Details below:
    df = pd.DataFrame(
        {1: ["value for string index", "value for int index"]}, ["string", 42]
    )
    df.to_excel("index_dtype.xlsx", header=None)

Finally, copy-paste the following code into Python:

import pandas as pd
pd.read_excel("index_dtype.xlsx", header=None, index_col=0, dtype=str).iloc[:, 0].to_dict()              

Output:

{'string': 'value for string index', 42: 'value for int index'}

Problem description

The integer value in the index column appears as an int instead of a str in the index of the DataFrame read from an Excel file, even though dtype=str was specified.

I verified this behavior with spreadsheets

  • exported from LibreOffice 6.2.8.2 as Excel 2007-2019 (.xlsx),
  • exported from LibreOffice 6.2.8.2 as Excel 97-2003 (.xls), and
  • generated like shown in 3. create an Excel file from Python above.

Unfortunately I don't have Excel to check if "genuine" Excel spreadsheets cause this as well.

The index dtype is correct if:

  • read_csv() is used, or
  • index_col=0 is omitted (values in the first column all become str)

The index dtype is still wrong even if dtype={0: str, 1: str} is used to specify the dtype for each column separately.

Expected Output

{'string': 'value for string index', '42': 'value for int index'}

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.11-100.fc29.x86_64
Version : #1 SMP Tue Nov 12 20:41:25 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Metadata

Metadata

Assignees

Labels

BugDtype ConversionsUnexpected or buggy dtype conversionsIO Excelread_excel, to_excelStringsString extension data type and string data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions