Skip to content

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
regmeg opened this issue Feb 26, 2020 · 7 comments
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Window rolling, ewma, expanding
Milestone

Comments

@regmeg
Copy link

regmeg commented Feb 26, 2020

Code Sample, a copy-pastable example if possible

import tracemalloc, linecache
import sys, os
import pandas as pd

def display_top_mem(snapshot, key_type='lineno', limit=10):
    """function for displaying lines of code taking most memory"""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def main():
    tracemalloc.start()
    periods = 745
    df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)

    for i in range(100):
        df = df_init.copy()

        df['l:c:B'] = df['c:B'].rolling(periods).min()
        df['h:c:B'] = df['c:B'].rolling(periods).max()

        #df['l:c:B'] = df['c:B'].rolling(periods).mean()
        #df['h:c:B'] = df['c:B'].rolling(periods).median()

        snapshot = tracemalloc.take_snapshot()
        display_top_mem(snapshot, limit=3)
        print(f'df size {sys.getsizeof(df)/1024} KiB')
        print(f'{i} ##################')


if __name__ == '__main__':
    main()

Problem description

Pandas rolling().min() and rolling().max() functions create memory leaks. I've run a tracemalloc line based memory profiling and <__array_function__ internals>:6 seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min() and rolling().max() is changed to rolling().mean()and rolling().median() an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min() and rolling().max() seem to be the problem.

The output of this script running for 100 iterations with <__array_function__ internals>:6 constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq

CSV file mem_debug_data.csv used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv

Expected Output

Running rolling().min() and rolling().max() constantly over time should not grow RAM consumption.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-88-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@regmeg
Copy link
Author

regmeg commented Feb 27, 2020

Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.

def rolling_window_nan_filled(a_org, window):
    a = np.concatenate(( np.full(window-1,np.nan), a_org))
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def numpy_rolling_min(values, periods):
    return np.min(rolling_window_nan_filled(values, periods), axis=1)

def numpy_rolling_max(values, periods):
    return np.max(rolling_window_nan_filled(values, periods), axis=1)

numpy_rolling_min() and numpy_rolling_max() expect numpy based values from a pandas series that can be achieved by df[column].values.

@jorisvandenbossche
Copy link
Member

@regmeg can you check if you see the same problem with pandas 0.25, or whether it is new in 1.0?

@regmeg
Copy link
Author

regmeg commented Feb 29, 2020

Hi @jorisvandenbossche. thanks for your reply. I've just rerun the script with 0.25, the memory does not seem to accumulate, so there is no memory leak.

The script ive submitted is a copy paste script, it should be easy to replicate with 1.0.1, you just need to download the dataset un run the script. The leak occurs on both my local linux machine and the docker linux-python based images on instances.

@xmatthias
Copy link

xmatthias commented Apr 23, 2020

This is a pretty severe bug in my eyes - so i think this should get higher priority.

It's happening with the latest version of Pandas, too (1.0.3 as of the time of writing, + in the current master) if that helps any.
Can also confirm that it doesn't happen with 0.25.3.

Doing some investigation:

Running git bisect between v1.0.0 and v0.25.3 each testing the above code-segment i got the following output:

6e5d14834072e7856987eb31e574b2a05db9f0b9 is the first bad commit
commit 6e5d14834072e7856987eb31e574b2a05db9f0b9
Author: Matthew Roeschke <[email protected]>
Date:   Thu Nov 21 04:59:30 2019 -0800

    REF: Separate window bounds calculation from aggregation functions (#29428)

:040000 040000 163ddd42a163c6da0f81a44827efb37bf195cefd 642367c538696b935b83329ba4656c47e838d5fc M      pandas
:100755 100755 545765ecb114d20248f81d1bdaacf6bfd3b53050 0915b6aba113a1af9976db69d791c72997feea95 M      setup.py

The last good commit seems to be a46806c, while the one introducing the problem is 6e5d148 .

Now i fail to see why it would work for mean, but not for min/max ... but i hope this helps someone with more knowledge in the pandas code to find the problem quickly.

@xmatthias
Copy link

Additional info - #33693 will fix this issue ...

@jreback jreback added Performance Memory or execution speed performance Window rolling, ewma, expanding labels Apr 27, 2020
@jreback jreback added this to the 1.1 milestone Apr 27, 2020
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Apr 27, 2020
@jreback
Copy link
Contributor

jreback commented Jul 10, 2020

fixed by #33693 in 1.0.4 i think

@jreback jreback closed this as completed Jul 10, 2020
@hmate9
Copy link

hmate9 commented Jul 13, 2020

Confirmed fixed in 1.0.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Performance Memory or execution speed performance Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

5 participants