Skip to content

Managing pandas's hash table footprint #4491

Closed
@wesm

Description

@wesm

Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames.

Example:

s = pd.Series(np.random.randn(10000000))
s[100]

Note that after indexing into s, the first time this happens a hash table s.index._engine is populated. So something larger than 80MB of space probably gets used up by this.

We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables.

Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as uint32_t resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values.

motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions