Skip to content

read_excel crashes python for certain files #23809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Dobatymo opened this issue Nov 20, 2018 · 7 comments · Fixed by #32548
Closed

read_excel crashes python for certain files #23809

Dobatymo opened this issue Nov 20, 2018 · 7 comments · Fixed by #32548
Labels
Bug IO Excel read_excel, to_excel Segfault Non-Recoverable Error
Milestone

Comments

@Dobatymo
Copy link
Contributor

Dobatymo commented Nov 20, 2018

import pandas
pandas.read_excel('pandas_crash.xlsx')

pandas_crash.xlsx crashes the Python process.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
LOCALE: None.None

pandas: 0.23.4
pytest: 3.6.3
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.3
numpy: 1.15.4
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9dev0
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.10
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.2
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@Dobatymo Dobatymo changed the title .read_excel() crashes python .read_excel() crashes python for certain files Nov 20, 2018
@Dobatymo Dobatymo changed the title .read_excel() crashes python for certain files read_excel crashes python for certain files Nov 20, 2018
@TomAugspurger
Copy link
Contributor

Can you reproduce the crash using just the underlying engine (openpyxl or xlrd)?

Can you post the traceback?

@TomAugspurger TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Nov 20, 2018
@Dobatymo
Copy link
Contributor Author

Dobatymo commented Nov 21, 2018

There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however.

path = "pandas_crash.xlsx"

from openpyxl import load_workbook

sheet = load_workbook(path).active

for i in range(1, sheet.max_row+1):
	for j in range(1, sheet.max_column+1):
		print(repr(sheet.cell(row=i,column=j).value))

import xlrd

sheet = xlrd.open_workbook(path).sheet_by_index(0)

for i in range(0, sheet.nrows):
	for j in range(0, sheet.ncols):
		print(repr(sheet.cell(i, j).value))

Output:

'Column1'
'_xDC88_'
'Column1'
'\udc88'

The output of openpyxl is not correct, it seems it cannot handle the single surrogate. xlrd is correct.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 21, 2018 via email

@WillAyd
Copy link
Member

WillAyd commented Nov 21, 2018

Not an expert in this domain but what encoding is supposed to be represented here? I believe modern Excel files use utf-16 encoding internally but doesn't that surrogate fall outside of the high surrogate range for that encoding?

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Nov 21, 2018
@Dobatymo
Copy link
Contributor Author

The string is a single surrogate, which is not valid Unicode. However xlrd behaves correctly and passes the string through to python unmodified. I encountered this type of problem when exporting .xlsx files from SQL Server. It's possible they contain invalid Unicode strings.

read_excel(path, engine="xlrd")

fails as well.

I can verify that the source of the crash is from Pandas. Debugging with Visual Studio yields:

lib.cp36-win_amd64.pyd!00007ffd4eef114e()
lib.cp36-win_amd64.pyd!00007ffd4eef1400()
lib.cp36-win_amd64.pyd!00007ffd4eecf2bc()
lib.cp36-win_amd64.pyd!00007ffd4eed1428()
python36.dll!0000000064b2c902()
python36.dll!0000000064b2be83()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b512cf()
python36.dll!0000000064b5122d()
python36.dll!0000000064b511d7()
python36.dll!0000000064cba819()
python36.dll!0000000064cbafb9()
python36.dll!0000000064cba6f7()
python36.dll!0000000064c0a9f4()
python36.dll!0000000064b944f2()
python.exe!000000001c70126d()
kernel32.dll!00007ffd9a868102()
ntdll.dll!00007ffd9af9c5b4()

which is not terrible helpful, but at least we can be sure the crash is in pandas\_libs\lib.cp36-win_amd64.pyd

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

OK thanks. I think generally we have a few issues with handling surrogates in the parsers (can search issues for similar ones). Not sure if there's a way to handle gracefully with Python2 support but would in any case welcome investigation and PRs.

FYI dropping Python2 support officially at the start of 2019 so Compatibility won't be as much of an issue soon

@WillAyd WillAyd added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Nov 23, 2018
@WillAyd WillAyd added this to the Contributions Welcome milestone Nov 23, 2018
@jbrockmendel jbrockmendel added the Segfault Non-Recoverable Error label Oct 16, 2019
roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 8, 2020
roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 8, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Mar 11, 2020
@corimnally
Copy link

corimnally commented Feb 19, 2023

Is this resolved? I get the same issue where it crashes with no error thrown when calling pd.read_excel() with the path to an .xlsx file. I'm using python 3.10 and pandas 1.5.3. It works fine locally, but fails when it is run by a GitHub actions workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel Segfault Non-Recoverable Error
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants