Skip to content

Serializing modules can be slow #84

@mrocklin

Description

@mrocklin

Here is an analysis from a colleague:

Quote

The speed-up for us seems to be coming from the fact that pickling modules takes a long time:

In [25]: %timeit cloudpickle.dumps(numpy, -1)
100 loops, best of 3: 3.03 ms per loop

It looks like _find_module() will use imp.find_module() which traverses sys.path to look for things that look like numpy. In our environment, sys.path tends to be long and our filesystems tend to be slow, hence the 3.03 ms.

    def save_module(self, obj):
        """
        Save a module as an import
        """
        mod_name = obj.__name__
        # If module is successfully found then it is not a dynamically created module
        try:
            _find_module(mod_name)     # EXPENSIVE!!!!!
            is_dynamic = False
        except ImportError:
            is_dynamic = True

        self.modules.add(obj)
        if is_dynamic:
            self.save_reduce(dynamic_subimport, (obj.__name__, vars(obj)), obj=obj)
        else:
            self.save_reduce(subimport, (obj.__name__,), obj=obj)
    dispatch[types.ModuleType] = save_module

So it looks like cloudpickle is trying to allow for "dynamically created modules". If it didn't try to be this flexible, then the entire function should just be

self.save_reduce(subimport, (obj.__name__,), obj=obj)

So the danger is if people are using "dynamically created modules", which we don't tend to do.

Maybe an easy way out is to check if obj.__file__ exists (the attribute, not the file). If it does, then immediately assume that is_dynamic=False.

Fwiw, I think we're pickling numpy because we're pickling functions that refer to numpy. Not positive though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions