-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
zipimport.invalidate_caches()
implementation causes performance regression for importlib.invalidate_caches()
#103200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since |
@brettcannon In my opinion, it might still be considered a regression for The practical difference is whether we backport this change to 3.11. It feels to me that there's value in doing so. |
GH-103208) Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com> Co-authored-by: Brett Cannon <[email protected]>
https://github.com/python/cpython/issues/103200[: Fix performance issues with](1fb9bd2) zipimport.invalidate_caches() ( … |
@eliyahweinberg correct, it wasn't backported at all. |
In Python 3.10+, an implementation of
zipimport.invalidate_caches()
was introduced.An Apache Spark developer recently identified this implementation of
zipimport.invalidate_caches()
as the source of performance regressions forimportlib.invalidate_caches()
. They observed that importing only two zipped packages (py4j, and pyspark) slows down the speed ofimportlib.invalidate_caches()
up to 3500%. See the new discussion thread on the original PR wherezipimport.invalidate_caches()
was introduced for more context.The reason for this regression is an incorrect design for the API.
Currently in
zipimport.invalidate_caches()
, the cache of zip files is repopulated at the point of invalidation. This violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.There are three relevant events to consider:
invalidate_caches()
is calledEvents (1) and (2) should be fast, while event (3) can be slow since we're repopulating a cache. In the original PR, we made (1) and (3) fast, but (2) slow. To fix this we can do the following:
cache_is_valid
that is set to false wheninvalidate_caches()
is called._get_files()
, ifcache_is_valid
is true, use the cache. Ifcache_is_valid
is false, call_read_directory()
.This approach avoids any behaviour change introduced in Python 3.10+ and keeps the common path of reading the cache performant, while also shifting the cost of reading the directory out of cache invalidation.
We can go further and consider the fact that we rarely expect zip archives to change. Given this, we can consider adding a new flag to give users the option of disabling implicit invalidation of zipimported libaries when
importlib.invalidate_caches()
is called.cc @brettcannon @HyukjinKwon
Linked PRs
zipimport.invalidate_caches()
#103208The text was updated successfully, but these errors were encountered: