Skip to content

Memoize save_function #45

@mrocklin

Description

@mrocklin

I often need to serialize many small objects containing many python functions.

In [1]: def inc(x):
    return x + 1
   ...: 

In [2]: d = {i: (inc, i) for i in range(10000)}

Sometimes I do this all at once; this works great.

In [3]: from cloudpickle import dumps, loads

In [4]: %time len(dumps(d))
CPU times: user 118 ms, sys: 0 ns, total: 118 ms
Wall time: 117 ms

But sometimes I do this in several small batches, which is much slower.

In [5]: %time len([dumps(item) for item in d.items()])
CPU times: user 2.7 s, sys: 3.93 ms, total: 2.7 s
Wall time: 2.71 s

A quick profile shows that the majority of time is spent in save_function

In [7]: %prun -s cumtime len([dumps(item) for item in d.items()])
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.782    4.782 {built-in method exec}
        1    0.001    0.001    4.782    4.782 <string>:1(<module>)
        1    0.038    0.038    4.782    4.782 <string>:1(<listcomp>)
    10000    0.025    0.000    4.744    0.000 cloudpickle.py:598(dumps)
    10000    0.011    0.000    4.658    0.000 cloudpickle.py:104(dump)
    10000    0.030    0.000    4.646    0.000 pickle.py:401(dump)
450000/10000    0.894    0.000    4.597    0.000 pickle.py:460(save)
120000/10000    0.287    0.000    4.568    0.000 pickle.py:716(save_tuple)
50000/10000    0.115    0.000    4.296    0.000 cloudpickle.py:162(save_function)
    10000    0.053    0.000    4.254    0.000 cloudpickle.py:214(save_function_tuple)
    10000    0.020    0.000    2.834    0.000 cloudpickle.py:142(save_codeobject)
40000/10000    0.120    0.000    2.814    0.000 cloudpickle.py:470(save_reduce)
50000/40000    0.117    0.000    1.285    0.000 cloudpickle.py:318(save_global)
    20000    0.039    0.000    1.058    0.000 pickle.py:680(save_bytes)
   290000    0.519    0.000    1.044    0.000 pickle.py:416(memoize)
    40000    0.257    0.000    0.716    0.000 pickle.py:898(save_global)
    70000    0.147    0.000    0.479    0.000 pickle.py:698(save_str)
   800000    0.270    0.000    0.392    0.000 pickle.py:212(write)

And so I'm tempted to memoize save_function between dumps calls. Presumably with some sort of LRU mechanism, keying by object identity. This is unsafe if functions mutate in any way. I've never run into such a situation but I'm unsure if it's done elsewhere.

On looking into cloudpickle more deeply, it appears that Pickler has a caching mechanism within it. Does anyone have experience with these memo objects? I would need to clear out non-function elements from the cache between calls.

I'm happy to do the work here if we are able to agree on a good solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions