Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

regmeg · 2020-02-26T10:24:20Z

Code Sample, a copy-pastable example if possible

import tracemalloc, linecache
import sys, os
import pandas as pd

def display_top_mem(snapshot, key_type='lineno', limit=10):
    """function for displaying lines of code taking most memory"""
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


def main():
    tracemalloc.start()
    periods = 745
    df_init = pd.read_csv('./mem_debug_data.csv', index_col=0)

    for i in range(100):
        df = df_init.copy()

        df['l:c:B'] = df['c:B'].rolling(periods).min()
        df['h:c:B'] = df['c:B'].rolling(periods).max()

        #df['l:c:B'] = df['c:B'].rolling(periods).mean()
        #df['h:c:B'] = df['c:B'].rolling(periods).median()

        snapshot = tracemalloc.take_snapshot()
        display_top_mem(snapshot, limit=3)
        print(f'df size {sys.getsizeof(df)/1024} KiB')
        print(f'{i} ##################')


if __name__ == '__main__':
    main()

Problem description

Pandas rolling().min() and rolling().max() functions create memory leaks. I've run a tracemalloc line based memory profiling and <__array_function__ internals>:6 seems to always grow in size for every loop iteration in the script above with both of these functions present. For 1000 itereations it will consume around 650MB or RAM, whereas for example if rolling().min() and rolling().max() is changed to rolling().mean()and rolling().median() an run for 1000 iterations, RAM consumption will stay constant at around 4MB. Therefore rolling().min() and rolling().max() seem to be the problem.

The output of this script running for 100 iterations with <__array_function__ internals>:6 constantly increasing in size can be found here: https://pastebin.com/nvGKgmPq

CSV file mem_debug_data.csv used in the script can be found here: http://www.sharecsv.com/s/ad8485d8a0a24a5e12c62957de9b13bd/mem_debug_data.csv

Expected Output

Running rolling().min() and rolling().max() constantly over time should not grow RAM consumption.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-88-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

regmeg · 2020-02-27T09:06:32Z

Workaround to this is to use numpy with the following strides based functions. Apply and lambda from pandas can be used to on top of rolling, but it is very slow.

def rolling_window_nan_filled(a_org, window):
    a = np.concatenate(( np.full(window-1,np.nan), a_org))
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def numpy_rolling_min(values, periods):
    return np.min(rolling_window_nan_filled(values, periods), axis=1)

def numpy_rolling_max(values, periods):
    return np.max(rolling_window_nan_filled(values, periods), axis=1)

numpy_rolling_min() and numpy_rolling_max() expect numpy based values from a pandas series that can be achieved by df[column].values.

jorisvandenbossche · 2020-02-29T02:57:37Z

@regmeg can you check if you see the same problem with pandas 0.25, or whether it is new in 1.0?

regmeg · 2020-02-29T11:09:27Z

Hi @jorisvandenbossche. thanks for your reply. I've just rerun the script with 0.25, the memory does not seem to accumulate, so there is no memory leak.

The script ive submitted is a copy paste script, it should be easy to replicate with 1.0.1, you just need to download the dataset un run the script. The leak occurs on both my local linux machine and the docker linux-python based images on instances.

xmatthias · 2020-04-23T09:45:03Z

This is a pretty severe bug in my eyes - so i think this should get higher priority.

It's happening with the latest version of Pandas, too (1.0.3 as of the time of writing, + in the current master) if that helps any.
Can also confirm that it doesn't happen with 0.25.3.

Doing some investigation:

Running git bisect between v1.0.0 and v0.25.3 each testing the above code-segment i got the following output:

6e5d14834072e7856987eb31e574b2a05db9f0b9 is the first bad commit
commit 6e5d14834072e7856987eb31e574b2a05db9f0b9
Author: Matthew Roeschke <[email protected]>
Date:   Thu Nov 21 04:59:30 2019 -0800

    REF: Separate window bounds calculation from aggregation functions (#29428)

:040000 040000 163ddd42a163c6da0f81a44827efb37bf195cefd 642367c538696b935b83329ba4656c47e838d5fc M      pandas
:100755 100755 545765ecb114d20248f81d1bdaacf6bfd3b53050 0915b6aba113a1af9976db69d791c72997feea95 M      setup.py

The last good commit seems to be a46806c, while the one introducing the problem is 6e5d148 .

Now i fail to see why it would work for mean, but not for min/max ... but i hope this helps someone with more knowledge in the pandas code to find the problem quickly.

xmatthias · 2020-04-26T11:51:23Z

Additional info - #33693 will fix this issue ...

jreback · 2020-07-10T13:56:46Z

fixed by #33693 in 1.0.4 i think

hmate9 · 2020-07-13T13:08:58Z

Confirmed fixed in 1.0.4

This was referenced Apr 5, 2020

Memory leak problem in Ichimoku Indicator. freqtrade/technical#69

Closed

MemoryError in live in pandas lib freqtrade/freqtrade#2979

Closed

Freqtrade causing my VPS to crash freqtrade/freqtrade#3016

Closed

xmatthias mentioned this issue Apr 26, 2020

Memory leak on Ubuntu Server freqtrade/freqtrade#3220

Closed

hroff-1902 mentioned this issue Apr 26, 2020

Rolling min/max gives malloc error #30726

Closed

xmatthias mentioned this issue Apr 26, 2020

BUG: Fix memory issues in rolling.min/max #33693

Merged

5 tasks

jreback added Performance Memory or execution speed performance Window rolling, ewma, expanding labels Apr 27, 2020

jreback added this to the 1.1 milestone Apr 27, 2020

jreback added the Duplicate Report Duplicate issue or pull request label Apr 27, 2020

jreback closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

regmeg commented Feb 26, 2020 •

edited

Loading

INSTALLED VERSIONS

regmeg commented Feb 27, 2020 •

edited

Loading

Uh oh!

jorisvandenbossche commented Feb 29, 2020

Uh oh!

regmeg commented Feb 29, 2020

Uh oh!

xmatthias commented Apr 23, 2020 •

edited

Loading

Uh oh!

xmatthias commented Apr 26, 2020

Uh oh!

jreback commented Jul 10, 2020

Uh oh!

hmate9 commented Jul 13, 2020

Uh oh!

Uh oh!

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

Pandas 1.0.1 - .rolling().min() and .rolling().max() create memory leak at <__array_function__ internals>:6 #32266

Comments

regmeg commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

regmeg commented Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 29, 2020

Uh oh!

regmeg commented Feb 29, 2020

Uh oh!

xmatthias commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Doing some investigation:

Uh oh!

xmatthias commented Apr 26, 2020

Uh oh!

jreback commented Jul 10, 2020

Uh oh!

hmate9 commented Jul 13, 2020

Uh oh!

regmeg commented Feb 26, 2020 •

edited

Loading

Output of `pd.show_versions()`

regmeg commented Feb 27, 2020 •

edited

Loading

xmatthias commented Apr 23, 2020 •

edited

Loading