-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
read_excel crashes python for certain files #23809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
.read_excel()
crashes python.read_excel()
crashes python for certain files
.read_excel()
crashes python for certain files
Can you reproduce the crash using just the underlying engine (openpyxl or xlrd)? Can you post the traceback? |
There is no traceback as no exception is thrown. The process simply crashes. With plain path = "pandas_crash.xlsx"
from openpyxl import load_workbook
sheet = load_workbook(path).active
for i in range(1, sheet.max_row+1):
for j in range(1, sheet.max_column+1):
print(repr(sheet.cell(row=i,column=j).value))
import xlrd
sheet = xlrd.open_workbook(path).sheet_by_index(0)
for i in range(0, sheet.nrows):
for j in range(0, sheet.ncols):
print(repr(sheet.cell(i, j).value)) Output:
The output of |
There is no traceback as no exception is thrown. The process simply
crashes.
Strange.
xlrd is correct.
Does pd.read_excel with `engine='xldd'` work?
…On Tue, Nov 20, 2018 at 8:04 PM Dobatymo ***@***.***> wrote:
There is no traceback as no exception is thrown. The process simply
crashes. With plain xldr or openpyxl I can read the file however.
path = "pandas_crash.xlsx"
from openpyxl import load_workbook
sheet = load_workbook(path).active
for i in range(1, sheet.max_row+1):
for j in range(1, sheet.max_column+1):
print(repr(sheet.cell(row=i,column=j).value))
import xlrd
sheet = xlrd.open_workbook(path).sheet_by_index(0)
for i in range(0, sheet.nrows):
for j in range(0, sheet.ncols):
print(repr(sheet.cell(i, j).value))
Output:
'Column1'
'_xDC88_'
'Column1'
'\udc88'
The output of openpyxl is not correct, it seems it cannot handle the
single surrogates. xlrd is correct.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#23809 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIowKo5UM8P1e2EwIR1XVHQUxB1_sks5uxLS6gaJpZM4Yqh8t>
.
|
Not an expert in this domain but what encoding is supposed to be represented here? I believe modern Excel files use utf-16 encoding internally but doesn't that surrogate fall outside of the high surrogate range for that encoding? |
The string is a single surrogate, which is not valid Unicode. However xlrd behaves correctly and passes the string through to python unmodified. I encountered this type of problem when exporting
fails as well. I can verify that the source of the crash is from Pandas. Debugging with Visual Studio yields:
which is not terrible helpful, but at least we can be sure the crash is in |
OK thanks. I think generally we have a few issues with handling surrogates in the parsers (can search issues for similar ones). Not sure if there's a way to handle gracefully with Python2 support but would in any case welcome investigation and PRs. FYI dropping Python2 support officially at the start of 2019 so Compatibility won't be as much of an issue soon |
fixes pandas-dev#23809 Unit test added
fixes pandas-dev#23809 Unit test added
Is this resolved? I get the same issue where it crashes with no error thrown when calling pd.read_excel() with the path to an .xlsx file. I'm using python 3.10 and pandas 1.5.3. It works fine locally, but fails when it is run by a GitHub actions workflow. |
Uh oh!
There was an error while loading. Please reload this page.
pandas_crash.xlsx crashes the Python process.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
LOCALE: None.None
pandas: 0.23.4
pytest: 3.6.3
pip: 18.1
setuptools: 39.1.0
Cython: 0.28.3
numpy: 1.15.4
scipy: 1.0.1
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9dev0
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.10
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.2
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: