Managing pandas's hash table footprint

Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames. 

Example:

```
s = pd.Series(np.random.randn(10000000))
s[100]
```

Note that after indexing into `s`, the first time this happens a hash table `s.index._engine` is populated. So something larger than 80MB of space probably gets used up by this.

We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables. 

Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as `uint32_t` resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values. 

motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Managing pandas's hash table footprint #4491

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Managing pandas's hash table footprint #4491

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions