-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
gh-103200: Fix performance issues with zipimport.invalidate_caches()
#103208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-103200: Fix performance issues with zipimport.invalidate_caches()
#103208
Conversation
2618585
to
e01f4f6
Compare
a9d791a
to
3ed86c0
Compare
3ed86c0
to
e2bd85a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned there's a disconnect between the cache being marked as dirty in an instance of zipimporter
and _zip_directory_cache
.
If you were to create two instances of zipimporter
, call invalidate_caches()
on one instance but never use it to cause _get_files()
to be called, and then continue to using the second instance, the supposedly poisoned cache will continue to be used by the second instance.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
4ab013f
to
18a333c
Compare
18a333c
to
eaaa0dc
Compare
I have made the requested changes; please review again |
Thanks for making the requested changes! @brettcannon: please review the changes made to this pull request. |
LGTM! I tweaked some comments to be have proper punctuation and updated the branch. I have turned on auto-merge so this should go in once CI passes! |
Thanks @brettcannon, I appreciate it! |
I think this may have caused a failure in the dask/distributed test suite: dask/distributed#8708 . I'm not sure if that means it's a bug in cpython, or just the distributed test suite doing something weird. The test suite does this:
the
and that
Note the somewhat odd thing the test does is to upload two different files with the same name in succession. The failure only happens in that case. If we make the test upload only once, or use a different filename each time, this doesn't happen. |
If I change the
|
oh, forgot to mention, this test is all async. I tried building a minimal test reproducer that's synchronous, but it doesn't reproduce. I'll try doing an async reproducer tomorrow... |
@AdamWill while using Could you file an issue against this and tag me? |
…ated cache (python#121342) It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is not guaranteed that the cache is acutally filled. This changed fixes this by using the internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
…ated cache (python#121342) It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is not guaranteed that the cache is acutally filled. This changed fixes this by using the internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
…ated cache (python#121342) It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is not guaranteed that the cache is acutally filled. This changed fixes this by using the internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
…ated cache (python#121342) It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is not guaranteed that the cache is acutally filled. This changed fixes this by using the internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
…ated cache (python#121342) It is no longer safe to directly access `zipimport._zip_directory_cache` since python#103208. It is not guaranteed that the cache is acutally filled. This changed fixes this by using the internal method `_get_files` instead, which may not be the best solution, but fixes the issue.
This PR fixes the over-eagerness of the original
zipimport.invalidate_caches()
implementation.Currently in
zipimport.invalidate_caches()
, the cache of zip files is repopulated at the point of invalidation. This causes cache invalidation to be slow, and violates the semantics of cache invalidation which should simply clear the cache. Cache repopulation should occur on the next access of files.There are three relevant events to consider:
invalidate_caches()
is calledEvents (1) and (2) should be fast, while event (3) can be slow since we're repopulating a cache. In the original implementation, (1) and (3) are fast, but (2) is slow.
This PR shifts the cost of reading the directory out of cache invalidation and back to cache access, while avoiding any behaviour change introduced in Python 3.10+ and keeping the common path of reading the cache performant.
Ideally, this fix should be backported to Python 3.10+.
zipimport.invalidate_caches()
implementation causes performance regression forimportlib.invalidate_caches()
#103200