Skip to content

gh-103200: Fix performance issues with zipimport.invalidate_caches() #103208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Apr 3, 2023

This PR fixes the over-eagerness of the original zipimport.invalidate_caches() implementation.

Currently in zipimport.invalidate_caches(), the cache of zip files is repopulated at the point of invalidation. This causes cache invalidation to be slow, and violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.

There are three relevant events to consider:

  1. The cache is accessed while valid
  2. invalidate_caches() is called
  3. The cache is accessed after being invalidated

Events (1) and (2) should be fast, while event (3) can be slow since we're repopulating a cache. In the original implementation, (1) and (3) are fast, but (2) is slow.

This PR shifts the cost of reading the directory out of cache invalidation and back to cache access, while avoiding any behaviour change introduced in Python 3.10+ and keeping the common path of reading the cache performant.

Ideally, this fix should be backported to Python 3.10+.

@ghost
Copy link

ghost commented Apr 3, 2023

All commit authors signed the Contributor License Agreement.
CLA signed

@desmondcheongzx desmondcheongzx force-pushed the fix-zipimport-invalidate-caches branch from 2618585 to e01f4f6 Compare April 3, 2023 08:03
@desmondcheongzx desmondcheongzx force-pushed the fix-zipimport-invalidate-caches branch 2 times, most recently from a9d791a to 3ed86c0 Compare April 3, 2023 08:27
@desmondcheongzx desmondcheongzx force-pushed the fix-zipimport-invalidate-caches branch from 3ed86c0 to e2bd85a Compare April 3, 2023 08:30
@arhadthedev arhadthedev added performance Performance or resource usage stdlib Python modules in the Lib dir labels Apr 3, 2023
Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned there's a disconnect between the cache being marked as dirty in an instance of zipimporter and _zip_directory_cache.

If you were to create two instances of zipimporter, call invalidate_caches() on one instance but never use it to cause _get_files() to be called, and then continue to using the second instance, the supposedly poisoned cache will continue to be used by the second instance.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@desmondcheongzx desmondcheongzx force-pushed the fix-zipimport-invalidate-caches branch 2 times, most recently from 4ab013f to 18a333c Compare July 4, 2023 09:32
@desmondcheongzx desmondcheongzx force-pushed the fix-zipimport-invalidate-caches branch from 18a333c to eaaa0dc Compare July 4, 2023 09:35
@desmondcheongzx
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@brettcannon: please review the changes made to this pull request.

@bedevere-bot bedevere-bot requested a review from brettcannon July 4, 2023 21:11
@brettcannon brettcannon enabled auto-merge (squash) July 7, 2023 21:31
@brettcannon
Copy link
Member

LGTM! I tweaked some comments to be have proper punctuation and updated the branch. I have turned on auto-merge so this should go in once CI passes!

@brettcannon brettcannon self-assigned this Jul 7, 2023
@brettcannon brettcannon merged commit 1fb9bd2 into python:main Jul 7, 2023
@desmondcheongzx
Copy link
Contributor Author

Thanks @brettcannon, I appreciate it!

@AdamWill
Copy link
Contributor

AdamWill commented Jun 20, 2024

I think this may have caused a failure in the dask/distributed test suite: dask/distributed#8708 . I'm not sure if that means it's a bug in cpython, or just the distributed test suite doing something weird. The test suite does this:

        for value in [123, 456]:
            with tmp_text(
                "myfile.py", f"def f():\n    return {value}"
            ) as fn_my_file:
                with zipfile.ZipFile("myfile.zip", "w") as z:
                    z.write(fn_my_file, arcname=os.path.basename(fn_my_file))
                await c.upload_file("myfile.zip")

                x = c.submit(g, pure=False)
                result = await x
                assert result == value

the c.upload_file call triggers this bit of distributed.utils.import_file():

if ext in (".egg", ".zip", ".pyz"):
    if path not in sys.path:
        sys.path.insert(0, path)
    names = (mod_info.name for mod_info in pkgutil.iter_modules([path]))
    names_to_import.extend(names)

and that pkgutil.iter_modules([path]) blows up in iter_zipimport_modules, which accesses zipimport._zip_directory_cache directly:

        def iter_zipimport_modules(importer, prefix=''):
>           dirlist = sorted(zipimport._zip_directory_cache[importer.archive])
E           KeyError: '/tmp/dask-scratch-space/scheduler-belyj1dp/myfile.zip'

Note the somewhat odd thing the test does is to upload two different files with the same name in succession. The failure only happens in that case. If we make the test upload only once, or use a different filename each time, this doesn't happen.

@AdamWill
Copy link
Contributor

If I change the dirlist = line in pkgutil.py to be this, the test passes:

dirlist = sorted(zipimporter(importer.archive)._get_files())

@AdamWill
Copy link
Contributor

oh, forgot to mention, this test is all async. I tried building a minimal test reproducer that's synchronous, but it doesn't reproduce. I'll try doing an async reproducer tomorrow...

@desmondcheongzx
Copy link
Contributor Author

@AdamWill while using dirlist = sorted(zipimporter(importer.archive)._get_files()) works because the _get_files() method handles the keyError, I'm not sure how I feel about a private API being used. We should either make it public, or fix pkgutil. I'm leaning towards the latter because public APIs should not break user code.

Could you file an issue against this and tag me?

beachmachine added a commit to beachmachine/cpython that referenced this pull request Jul 13, 2024
…ated cache (python#121342)

It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is
not guaranteed that the cache is acutally filled. This changed fixes this by using the
internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
beachmachine added a commit to beachmachine/cpython that referenced this pull request Jul 13, 2024
…ated cache (python#121342)

It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is
not guaranteed that the cache is acutally filled. This changed fixes this by using the
internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
beachmachine added a commit to beachmachine/cpython that referenced this pull request Jul 13, 2024
…ated cache (python#121342)

It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is
not guaranteed that the cache is acutally filled. This changed fixes this by using the
internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
beachmachine added a commit to beachmachine/cpython that referenced this pull request Jul 13, 2024
…ated cache (python#121342)

It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is
not guaranteed that the cache is acutally filled. This changed fixes this by using the
internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
beachmachine added a commit to beachmachine/cpython that referenced this pull request Jul 14, 2024
…ated cache (python#121342)

It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is
not guaranteed that the cache is acutally filled. This changed fixes this by using the
internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants