-
Notifications
You must be signed in to change notification settings - Fork 184
Description
I often need to serialize many small objects containing many python functions.
In [1]: def inc(x):
return x + 1
...:
In [2]: d = {i: (inc, i) for i in range(10000)}Sometimes I do this all at once; this works great.
In [3]: from cloudpickle import dumps, loads
In [4]: %time len(dumps(d))
CPU times: user 118 ms, sys: 0 ns, total: 118 ms
Wall time: 117 msBut sometimes I do this in several small batches, which is much slower.
In [5]: %time len([dumps(item) for item in d.items()])
CPU times: user 2.7 s, sys: 3.93 ms, total: 2.7 s
Wall time: 2.71 sA quick profile shows that the majority of time is spent in save_function
In [7]: %prun -s cumtime len([dumps(item) for item in d.items()])
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 4.782 4.782 {built-in method exec}
1 0.001 0.001 4.782 4.782 <string>:1(<module>)
1 0.038 0.038 4.782 4.782 <string>:1(<listcomp>)
10000 0.025 0.000 4.744 0.000 cloudpickle.py:598(dumps)
10000 0.011 0.000 4.658 0.000 cloudpickle.py:104(dump)
10000 0.030 0.000 4.646 0.000 pickle.py:401(dump)
450000/10000 0.894 0.000 4.597 0.000 pickle.py:460(save)
120000/10000 0.287 0.000 4.568 0.000 pickle.py:716(save_tuple)
50000/10000 0.115 0.000 4.296 0.000 cloudpickle.py:162(save_function)
10000 0.053 0.000 4.254 0.000 cloudpickle.py:214(save_function_tuple)
10000 0.020 0.000 2.834 0.000 cloudpickle.py:142(save_codeobject)
40000/10000 0.120 0.000 2.814 0.000 cloudpickle.py:470(save_reduce)
50000/40000 0.117 0.000 1.285 0.000 cloudpickle.py:318(save_global)
20000 0.039 0.000 1.058 0.000 pickle.py:680(save_bytes)
290000 0.519 0.000 1.044 0.000 pickle.py:416(memoize)
40000 0.257 0.000 0.716 0.000 pickle.py:898(save_global)
70000 0.147 0.000 0.479 0.000 pickle.py:698(save_str)
800000 0.270 0.000 0.392 0.000 pickle.py:212(write)
And so I'm tempted to memoize save_function between dumps calls. Presumably with some sort of LRU mechanism, keying by object identity. This is unsafe if functions mutate in any way. I've never run into such a situation but I'm unsure if it's done elsewhere.
On looking into cloudpickle more deeply, it appears that Pickler has a caching mechanism within it. Does anyone have experience with these memo objects? I would need to clear out non-function elements from the cache between calls.
I'm happy to do the work here if we are able to agree on a good solution.