Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Make sure you have Pandas and xlrd
installed in a virtualenv, and launch the Python interpreter. You'll need openpyxl
only if you decide to create the Excel file from Python (see below).
python3 -m venv /tmp/pandas-issue-35816
. /tmp/pandas-issue-35816/bin/activate
pip install pandas xlrd openpyxl
python
To get the Excel files for reproducing the bug, you can either
- download index_dtype.xls (note: attachment gzipped) or index_dtype.xlsx, or
- copy-paste the text inside
Details
below into aindex_dtype.csv
file, import that into Excel, and save it in Excel format asindex_dtype.xlsx
, orstring,value for string index 42,value for int index
- create an Excel file from Python by copy-pasting from
Details
below:df = pd.DataFrame( {1: ["value for string index", "value for int index"]}, ["string", 42] ) df.to_excel("index_dtype.xlsx", header=None)
Finally, copy-paste the following code into Python:
import pandas as pd
pd.read_excel("index_dtype.xlsx", header=None, index_col=0, dtype=str).iloc[:, 0].to_dict()
Output:
{'string': 'value for string index', 42: 'value for int index'}
Problem description
The integer value in the index column appears as an int
instead of a str
in the index of the DataFrame
read from an Excel file, even though dtype=str
was specified.
I verified this behavior with spreadsheets
- exported from LibreOffice 6.2.8.2 as
Excel 2007-2019 (.xlsx)
, - exported from LibreOffice 6.2.8.2 as
Excel 97-2003 (.xls)
, and - generated like shown in
3. create an Excel file from Python
above.
Unfortunately I don't have Excel to check if "genuine" Excel spreadsheets cause this as well.
The index dtype is correct if:
read_csv()
is used, orindex_col=0
is omitted (values in the first column all becomestr
)
The index dtype is still wrong even if dtype={0: str, 1: str}
is used to specify the dtype for each column separately.
Expected Output
{'string': 'value for string index', '42': 'value for int index'}
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.11-100.fc29.x86_64
Version : #1 SMP Tue Nov 12 20:41:25 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None