Skip to content

zipimport.invalidate_caches() implementation causes performance regression for importlib.invalidate_caches() #103200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
desmondcheongzx opened this issue Apr 2, 2023 · 5 comments
Labels
3.12 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@desmondcheongzx
Copy link
Contributor

desmondcheongzx commented Apr 2, 2023

In Python 3.10+, an implementation of zipimport.invalidate_caches() was introduced.

An Apache Spark developer recently identified this implementation of zipimport.invalidate_caches() as the source of performance regressions for importlib.invalidate_caches(). They observed that importing only two zipped packages (py4j, and pyspark) slows down the speed of importlib.invalidate_caches() up to 3500%. See the new discussion thread on the original PR where zipimport.invalidate_caches() was introduced for more context.

The reason for this regression is an incorrect design for the API.

Currently in zipimport.invalidate_caches(), the cache of zip files is repopulated at the point of invalidation. This violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.

There are three relevant events to consider:

  1. The cache is accessed while valid
  2. invalidate_caches() is called
  3. The cache is accessed after being invalidated

Events (1) and (2) should be fast, while event (3) can be slow since we're repopulating a cache. In the original PR, we made (1) and (3) fast, but (2) slow. To fix this we can do the following:

  • Add a boolean flag cache_is_valid that is set to false when invalidate_caches() is called.
  • In _get_files(), if cache_is_valid is true, use the cache. If cache_is_valid is false, call _read_directory().

This approach avoids any behaviour change introduced in Python 3.10+ and keeps the common path of reading the cache performant, while also shifting the cost of reading the directory out of cache invalidation.

We can go further and consider the fact that we rarely expect zip archives to change. Given this, we can consider adding a new flag to give users the option of disabling implicit invalidation of zipimported libaries when importlib.invalidate_caches() is called.

cc @brettcannon @HyukjinKwon

Linked PRs

@desmondcheongzx desmondcheongzx added the type-bug An unexpected behavior, bug, or error label Apr 2, 2023
@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Apr 3, 2023
@arhadthedev
Copy link
Member

@Yhg1s (as a zipimport module expert)

@arhadthedev arhadthedev added performance Performance or resource usage 3.11 only security fixes 3.10 only security fixes 3.12 only security fixes labels Apr 3, 2023
@brettcannon
Copy link
Member

Since zipimport.invalidate_caches() was introduced in Python 3.10, I don't think this is a regression but a potential place to improve performance.

@brettcannon brettcannon added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error 3.11 only security fixes 3.10 only security fixes labels Apr 4, 2023
@desmondcheongzx
Copy link
Contributor Author

@brettcannon In my opinion, it might still be considered a regression for importlib (not zipimport), since prior to 3.10, importlib.invalidate_caches would be a no-op on zipimported libraries. Post 3.10, importlib.invalidate_caches seems to slow down significantly more than with a proper implementation of zipimport.invalidate_caches.

The practical difference is whether we backport this change to 3.11. It feels to me that there's value in doing so.

brettcannon added a commit that referenced this issue Jul 7, 2023
GH-103208)

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Brett Cannon <[email protected]>
@eliyahweinberg
Copy link

https://github.com/python/cpython/issues/103200[: Fix performance issues with](1fb9bd2) zipimport.invalidate_caches() (
1fb9bd2
#103208)
Is not merged to python 3.11

@brettcannon
Copy link
Member

@eliyahweinberg correct, it wasn't backported at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 only security fixes performance Performance or resource usage stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants