Skip to content

Segmentation fault when trying to load large json file #19194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
levon003 opened this issue Jan 11, 2018 · 2 comments
Closed

Segmentation fault when trying to load large json file #19194

levon003 opened this issue Jan 11, 2018 · 2 comments
Labels
IO JSON read_json, to_json, json_normalize

Comments

@levon003
Copy link

Code Sample

import pandas as pd
site_filename = "/path/to/file.json"
df = pd.read_json(site_filename, lines=True)

Output: Segmentation fault

Problem description

Similar issue to #11344, with a 1.2G file (specifically 1216272 bytes) with 70+ keys across the JSON records and multiple nested keys.

GDB output:

gdb python
(gdb) run produce_site_report.py 
Starting program: /path/to/anaconda3/bin/python produce_site_report.py
[Thread debugging using libthread_db enabled]
Missing separate debuginfo for /path/to/anaconda3/lib/python3.6/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
[New Thread 0x7fffeda67700 (LWP 10073)]
[New Thread 0x7fffed066700 (LWP 10074)]
[New Thread 0x7fffea665700 (LWP 10075)]
[New Thread 0x7fffe7c64700 (LWP 10076)]
[New Thread 0x7fffe5263700 (LWP 10077)]
[New Thread 0x7fffe2862700 (LWP 10078)]
[New Thread 0x7fffdfe61700 (LWP 10079)]
[New Thread 0x7fffdd460700 (LWP 10080)]
[New Thread 0x7fffdaa5f700 (LWP 10081)]
[New Thread 0x7fffd805e700 (LWP 10082)]
[New Thread 0x7fffd565d700 (LWP 10083)]
[New Thread 0x7fffd2c5c700 (LWP 10084)]
[New Thread 0x7fffd025b700 (LWP 10085)]
[New Thread 0x7fffcd85a700 (LWP 10086)]
[New Thread 0x7fffcae59700 (LWP 10087)]
[New Thread 0x7fffc8458700 (LWP 10088)]
[New Thread 0x7fffc5a57700 (LWP 10089)]
[New Thread 0x7fffc3056700 (LWP 10090)]
[New Thread 0x7fffc0655700 (LWP 10091)]
[New Thread 0x7fffbdc54700 (LWP 10092)]
[New Thread 0x7fffbb253700 (LWP 10093)]
[New Thread 0x7fffb8852700 (LWP 10094)]
[New Thread 0x7fffb5e51700 (LWP 10095)]
Program received signal SIGSEGV, Segmentation fault.
JSON_DecodeObject (dec=0x7fffffffc870, 
    buffer=0x7ffe55989030 "[{\"lastName\":\"FbUPTgGSEbSiFro\",\"num\":1,\"description\":\"FbUPTgGSEbSiFro5ci23kd"..., 
    buffer=1130799709) at pandas/_libs/src/ujson/lib/ultrajsondec.c:1111
1111	pandas/_libs/src/ujson/lib/ultrajsondec.c: No such file or directory.
	in pandas/_libs/src/ujson/lib/ultrajsondec.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.x86_64
(gdb) backtrace
#0  JSON_DecodeObject (dec=0x7fffffffc870, 
    buffer=0x7ffe55989030 "[{\"lastName\":\"FbUPTgGSEbSiFro\",\"num\":1,\"description\":\"FbUPTgGSEbSiFro5ci23kd"..., 
    buffer=1130799709) at pandas/_libs/src/ujson/lib/ultrajsondec.c:1111
#1  0x00007fffa683fd46 in JSONToObj (self=Unhandled dwarf expression opcode 0xf3
)
    at pandas/_libs/src/ujson/python/JSONtoObj.c:562
#2  0x00007ffff7bb6364 in _PyCFunction_FastCallDict ()
#3  0x00007ffff7be4f30 in _PyCFunction_FastCallKeywords ()
#4  0x00007ffff7c48ebc in call_function ()
#5  0x00007ffff7c6b3e7 in _PyEval_EvalFrameDefault ()
#6  0x00007ffff7c42b8b in fast_function ()
#7  0x00007ffff7c48f95 in call_function ()
#8  0x00007ffff7c6a62a in _PyEval_EvalFrameDefault ()
#9  0x00007ffff7c42b8b in fast_function ()
#10 0x00007ffff7c48f95 in call_function ()
#11 0x00007ffff7c6a62a in _PyEval_EvalFrameDefault ()
#12 0x00007ffff7c42b8b in fast_function ()
#13 0x00007ffff7c48f95 in call_function ()
#14 0x00007ffff7c6a62a in _PyEval_EvalFrameDefault ()
#15 0x00007ffff7c42b8b in fast_function ()
#16 0x00007ffff7c48f95 in call_function ()
#17 0x00007ffff7c6a62a in _PyEval_EvalFrameDefault ()
#18 0x00007ffff7c41f24 in _PyEval_EvalCodeWithName ()
#19 0x00007ffff7c42dc1 in fast_function ()
#20 0x00007ffff7c48f95 in call_function ()
#21 0x00007ffff7c6b3e7 in _PyEval_EvalFrameDefault ()
#22 0x00007ffff7c42b8b in fast_function ()
#23 0x00007ffff7c48f95 in call_function ()
#24 0x00007ffff7c6a62a in _PyEval_EvalFrameDefault ()
#25 0x00007ffff7c438d9 in PyEval_EvalCodeEx ()
#26 0x00007ffff7c4467c in PyEval_EvalCode ()
#27 0x00007ffff7cbece4 in run_mod ()
#28 0x00007ffff7cbf0e1 in PyRun_FileExFlags ()
#29 0x00007ffff7cbf2e4 in PyRun_SimpleFileExFlags ()
#30 0x00007ffff7cc2daf in Py_Main ()
#31 0x00007ffff7b898be in main ()

Bizarrely, the seg fault doesn't occur in the ipython REPL e.g. executing the script with %load produce_site_report.py is just fine.

Expected Output

No Segmentation fault.

Output of pd.show_versions()

>>> pandas.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-696.18.7.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.0
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Data File

The data file at question cannot be shared. When loaded successfully in the REPL, the produced dataframe (df.dtypes) has: Length: 72, dtype: object. Full dtype list:

int64 object object object object float64 object float64 object object object float64 float64 object object object object float64 float64 object object float64 float64 float64 object object float64 float64 float64 float64 object object object float64 object object object object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object object float64 object object object object object object float64 float64 float64 float64 float64 object object object object object object object object float64
@jreback
Copy link
Contributor

jreback commented Jan 12, 2018

@levon003 without a reproducible example it will not be possible help w/o guessing. so pls post a minimal example that shows the issue.

@jreback jreback added Can't Repro IO JSON read_json, to_json, json_normalize labels Jan 12, 2018
@levon003
Copy link
Author

Hey @jreback, I understand this issue isn't much use without a reproducible example. Unfortunately, I won't be able to provide one, as I can't really speculate on what the features of the file are that are causing the issue without guessing myself. I figured it would be good to share to (1) document that there is at least some subset of valid json files that causes a seg fault in the current implementation that (2) functions differently in the REPL vs the Python interpreter and to (3) provide a GDB stack trace to assist in future debugging if a similar issue is raised in the future. If I'm wrong on those counts, feel free to delete the issue wholesale.

For now, I'm closing the issue, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

2 participants