Skip to content

Managing pandas's hash table footprint #4491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wesm opened this issue Aug 7, 2013 · 4 comments
Closed

Managing pandas's hash table footprint #4491

wesm opened this issue Aug 7, 2013 · 4 comments
Labels
Enhancement Performance Memory or execution speed performance

Comments

@wesm
Copy link
Member

wesm commented Aug 7, 2013

Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames.

Example:

s = pd.Series(np.random.randn(10000000))
s[100]

Note that after indexing into s, the first time this happens a hash table s.index._engine is populated. So something larger than 80MB of space probably gets used up by this.

We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables.

Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as uint32_t resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values.

motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing

@wesm
Copy link
Member Author

wesm commented Aug 7, 2013

Separately, the memory usage of MultiIndex is totally unacceptable in reindexing operations (the creation of an array of python tuples --> hash table is super inefficient). More work to do there too. cc @njsmith

@jreback
Copy link
Contributor

jreback commented Aug 7, 2013

So the garbage collector DOES get the memory, but if we add a __del__ it it removed 'faster'

Current master

In [1]: import gc

In [2]: def f():
   ...:     s = Series(randn(100000))
   ...:     s[100]
   ...:     

In [3]: %memit -r 10 f()
maximum of 10: 76.613281 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 105.117188 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 132.832031 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 76.640625 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 104.425781 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 132.210938 MB per loop

Adding to core/series.py

    def __del__(self):
        if self._index is not None:
            self._index._cleanup()
            self._index = None
In [1]: import gc

In [2]: def f():
   ...:     s = Series(randn(100000))
   ...:     s[100]
   ...:     

In [3]: %memit -r 10 f()
maximum of 10: 58.546875 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 66.179688 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 74.593750 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 59.378906 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 67.007812 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 74.648438 MB per loop

@njsmith
Copy link

njsmith commented Aug 7, 2013

If you want memory to be released more promptly, then your best bet is to
get rid of the reference cycle, not add a del method. You clearly have
some reference cycle somewhere since that's the only situation in which
memory is * not* freed immediately, and also the only situation in which
the gc gets involved at all. But if you have a del method on an
object in a cycle, then it can't be freed at all (!), so del is a risky
tool to be using to manage cycles.
On 7 Aug 2013 02:49, "jreback" [email protected] wrote:

So the garbage collector DOES get the memory, but if we add a del it
it removed 'faster'

Current master

In [1]: import gc

In [2]: def f():
...: s = Series(randn(100000))
...: s[100]
...:

In [3]: %memit -r 10 f()
maximum of 10: 76.613281 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 105.117188 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 132.832031 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 76.640625 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 104.425781 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 132.210938 MB per loop

Adding to core/series.py

def __del__(self):
    if self._index is not None:
        self._index._cleanup()
        self._index = None

In [1]: import gc

In [2]: def f():
...: s = Series(randn(100000))
...: s[100]
...:

In [3]: %memit -r 10 f()
maximum of 10: 58.546875 MB per loop

In [4]: %memit -r 10 f()
maximum of 10: 66.179688 MB per loop

In [5]: %memit -r 10 f()
maximum of 10: 74.593750 MB per loop

In [6]: gc.collect()
Out[6]: 260

In [7]: %memit -r 10 f()
maximum of 10: 59.378906 MB per loop

In [8]: %memit -r 10 f()
maximum of 10: 67.007812 MB per loop

In [9]: %memit -r 10 f()
maximum of 10: 74.648438 MB per loop


Reply to this email directly or view it on GitHubhttps://github.com//issues/4491#issuecomment-22224808
.

@wesm
Copy link
Member Author

wesm commented Sep 29, 2016

Tabled for pandas 2.0

@wesm wesm closed this as completed Sep 29, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Someday, No action Sep 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants