-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Managing pandas's hash table footprint #4491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Separately, the memory usage of MultiIndex is totally unacceptable in reindexing operations (the creation of an array of python tuples --> hash table is super inefficient). More work to do there too. cc @njsmith |
So the garbage collector DOES get the memory, but if we add a Current master
Adding to
|
If you want memory to be released more promptly, then your best bet is to
|
Tabled for pandas 2.0 |
Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames.
Example:
Note that after indexing into
s
, the first time this happens a hash tables.index._engine
is populated. So something larger than 80MB of space probably gets used up by this.We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables.
Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as
uint32_t
resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values.motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing
The text was updated successfully, but these errors were encountered: