Description
Many operations in pandas trigger the population of a hash table from the underlying index values. This can be very costly in memory usage, especially for very long Series / DataFrames.
Example:
s = pd.Series(np.random.randn(10000000))
s[100]
Note that after indexing into s
, the first time this happens a hash table s.index._engine
is populated. So something larger than 80MB of space probably gets used up by this.
We can have a whole discussion about the hash table infra and how it could be better than it is now (I may spend some time on this myself soon). For the time being, one solution would be to have a weakref dictionary someplace (or something of that nature) that enables us to globally destroy all hash tables.
Separately, the hash table scheme can be modified to have much smaller memory footprint than it does now -- keys can be integers stored as uint32_t
resulting in roughly 4 bytes per index value. Right now there are two arrays: one for hash keys, another for hash values.
motivated by http://stackoverflow.com/questions/18070520/pandas-memory-usage-when-reindexing