Skip to content

Memory leak on to_json? #26347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorgecarleitao opened this issue May 11, 2019 · 2 comments
Closed

Memory leak on to_json? #26347

jorgecarleitao opened this issue May 11, 2019 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@jorgecarleitao
Copy link

Code Sample, a copy-pastable example if possible

import resource

import pandas as pd

# some random data, some of them as array columns
path = 'data.parquet'
batches = 5000
df = pd.DataFrame({
    't': [pd.np.array(range(0, 180 * 60, 5))] * batches,
})


# read the data above and convert it to json (e.g. the backend of a restful API)
for i in range(100):
    # comment any of the 2 lines for the leak to vanish.
    print(df['t'].iloc[0].shape, df['t'].iloc[0].dtype)
    df['t'] = df['t'].apply(lambda x: pd.np.array(list(x)))
    df['t'].to_json()
    print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Problem description

The code above gives the following result (only works on linux/Mac)

254384
337780
422984
508720
593996
679336
762564
848036
933768
1019040
1104384
1187612
1273084
1358552
1444088
1529432
1612660
1698128
1783600
1869136
1954244
2037708
2122980
...

i.e. the memory peak is increasing, aka a memory leak.

Expected Output

The values above should not be unbounded.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1075-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.0.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.13.0
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.1
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented May 11, 2019

you don’t need 2 issues

@jreback jreback closed this as completed May 11, 2019
@gfyoung gfyoung added the Duplicate Report Duplicate issue or pull request label May 12, 2019
@gfyoung gfyoung added this to the No action milestone May 12, 2019
@patrickeganfoley
Copy link

I think the other issue was this one, which has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants