read_excel crashes python for certain files #23809

Dobatymo · 2018-11-20T07:30:38Z

import pandas
pandas.read_excel('pandas_crash.xlsx')

pandas_crash.xlsx crashes the Python process.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
LOCALE: None.None

pandas: 0.23.4
pytest: 3.6.3
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.3
numpy: 1.15.4
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9dev0
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.10
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.2
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-11-20T12:19:50Z

Can you reproduce the crash using just the underlying engine (openpyxl or xlrd)?

Can you post the traceback?

Dobatymo · 2018-11-21T02:04:37Z

There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however.

path = "pandas_crash.xlsx"

from openpyxl import load_workbook

sheet = load_workbook(path).active

for i in range(1, sheet.max_row+1):
	for j in range(1, sheet.max_column+1):
		print(repr(sheet.cell(row=i,column=j).value))

import xlrd

sheet = xlrd.open_workbook(path).sheet_by_index(0)

for i in range(0, sheet.nrows):
	for j in range(0, sheet.ncols):
		print(repr(sheet.cell(i, j).value))

Output:

'Column1'
'_xDC88_'
'Column1'
'\udc88'

The output of openpyxl is not correct, it seems it cannot handle the single surrogate. xlrd is correct.

TomAugspurger · 2018-11-21T12:06:10Z

There is no traceback as no exception is thrown. The process simply

crashes. Strange.

xlrd is correct.

Does pd.read_excel with `engine='xldd'` work?

…

On Tue, Nov 20, 2018 at 8:04 PM Dobatymo ***@***.***> wrote: There is no traceback as no exception is thrown. The process simply crashes. With plain xldr or openpyxl I can read the file however. path = "pandas_crash.xlsx" from openpyxl import load_workbook sheet = load_workbook(path).active for i in range(1, sheet.max_row+1): for j in range(1, sheet.max_column+1): print(repr(sheet.cell(row=i,column=j).value)) import xlrd sheet = xlrd.open_workbook(path).sheet_by_index(0) for i in range(0, sheet.nrows): for j in range(0, sheet.ncols): print(repr(sheet.cell(i, j).value)) Output: 'Column1' '_xDC88_' 'Column1' '\udc88' The output of openpyxl is not correct, it seems it cannot handle the single surrogates. xlrd is correct. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23809 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIowKo5UM8P1e2EwIR1XVHQUxB1_sks5uxLS6gaJpZM4Yqh8t> .

WillAyd · 2018-11-21T14:15:50Z

Not an expert in this domain but what encoding is supposed to be represented here? I believe modern Excel files use utf-16 encoding internally but doesn't that surrogate fall outside of the high surrogate range for that encoding?

Dobatymo · 2018-11-22T01:30:39Z

The string is a single surrogate, which is not valid Unicode. However xlrd behaves correctly and passes the string through to python unmodified. I encountered this type of problem when exporting .xlsx files from SQL Server. It's possible they contain invalid Unicode strings.

read_excel(path, engine="xlrd")

fails as well.

I can verify that the source of the crash is from Pandas. Debugging with Visual Studio yields:

lib.cp36-win_amd64.pyd!00007ffd4eef114e()
lib.cp36-win_amd64.pyd!00007ffd4eef1400()
lib.cp36-win_amd64.pyd!00007ffd4eecf2bc()
lib.cp36-win_amd64.pyd!00007ffd4eed1428()
python36.dll!0000000064b2c902()
python36.dll!0000000064b2be83()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3ca49()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2b940()
python36.dll!0000000064b2b725()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2485d()
python36.dll!0000000064b3f29f()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b2c2fa()
python36.dll!0000000064b3e34e()
python36.dll!0000000064b2cbf4()
python36.dll!0000000064b512cf()
python36.dll!0000000064b5122d()
python36.dll!0000000064b511d7()
python36.dll!0000000064cba819()
python36.dll!0000000064cbafb9()
python36.dll!0000000064cba6f7()
python36.dll!0000000064c0a9f4()
python36.dll!0000000064b944f2()
python.exe!000000001c70126d()
kernel32.dll!00007ffd9a868102()
ntdll.dll!00007ffd9af9c5b4()

which is not terrible helpful, but at least we can be sure the crash is in pandas\_libs\lib.cp36-win_amd64.pyd

WillAyd · 2018-11-23T03:23:48Z

OK thanks. I think generally we have a few issues with handling surrogates in the parsers (can search issues for similar ones). Not sure if there's a way to handle gracefully with Python2 support but would in any case welcome investigation and PRs.

FYI dropping Python2 support officially at the start of 2019 so Compatibility won't be as much of an issue soon

fixes pandas-dev#23809 Unit test added

corimnally · 2023-02-19T05:21:41Z

Is this resolved? I get the same issue where it crashes with no error thrown when calling pd.read_excel() with the path to an .xlsx file. I'm using python 3.10 and pandas 1.5.3. It works fine locally, but fails when it is run by a GitHub actions workflow.

Dobatymo changed the title ~~.read_excel() crashes python~~ .read_excel() crashes python for certain files Nov 20, 2018

Dobatymo changed the title ~~.read_excel() crashes python for certain files~~ read_excel crashes python for certain files Nov 20, 2018

TomAugspurger added the Needs Info Clarification about behavior needed to assess issue label Nov 20, 2018

WillAyd added the IO Excel read_excel, to_excel label Nov 21, 2018

WillAyd added Bug and removed Needs Info Clarification about behavior needed to assess issue labels Nov 23, 2018

WillAyd added this to the Contributions Welcome milestone Nov 23, 2018

jbrockmendel added the Segfault Non-Recoverable Error label Oct 16, 2019

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 8, 2020

Add extra check for failing UTF-8 conversion

ed4569e

fixes pandas-dev#23809 Unit test added

roberthdevries mentioned this issue Mar 8, 2020

BUG: Add extra check for failing UTF-8 conversion #32548

Merged

5 tasks

roberthdevries added a commit to roberthdevries/pandas that referenced this issue Mar 8, 2020

Add extra check for failing UTF-8 conversion

b6bb3f9

fixes pandas-dev#23809 Unit test added

jreback modified the milestones: Contributions Welcome, 1.1 Mar 11, 2020

WillAyd closed this as completed in #32548 Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_excel crashes python for certain files #23809

read_excel crashes python for certain files #23809

Dobatymo commented Nov 20, 2018 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Nov 20, 2018

Uh oh!

Dobatymo commented Nov 21, 2018 •

edited

Loading

Uh oh!

TomAugspurger commented Nov 21, 2018 via email

Uh oh!

WillAyd commented Nov 21, 2018

Uh oh!

Dobatymo commented Nov 22, 2018

Uh oh!

WillAyd commented Nov 23, 2018

Uh oh!

corimnally commented Feb 19, 2023 •

edited

Loading

Uh oh!

Uh oh!

read_excel crashes python for certain files #23809

read_excel crashes python for certain files #23809

Comments

Dobatymo commented Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Nov 20, 2018

Uh oh!

Dobatymo commented Nov 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Nov 21, 2018 via email

Uh oh!

WillAyd commented Nov 21, 2018

Uh oh!

Dobatymo commented Nov 22, 2018

Uh oh!

WillAyd commented Nov 23, 2018

Uh oh!

corimnally commented Feb 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dobatymo commented Nov 20, 2018 •

edited

Loading

Output of `pd.show_versions()`

Dobatymo commented Nov 21, 2018 •

edited

Loading

corimnally commented Feb 19, 2023 •

edited

Loading