-
-
Notifications
You must be signed in to change notification settings - Fork 32k
bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159
Conversation
Functions in zipimport and zipimporter methods used to access the files and toc entries stored in a ZIP archive via zipimporter's private _files attribute. However, this makes it hard to check if the ZIP archive had been updated, or if the cached information has been invalidated whenever _files is accessed. Hence, this change adds get_files() as an internal accessor for the _files attribute. Whenever get_files() is called, we do a check to make sure that our cached information is up-to-date. This check is done using a new _archive_mtime attribute that stores the last modified timestamp of the ZIP archive whenever zipimport._read_directory is called.
Zipimport's zipimporter are the finders that get cached in sys.path_importer_cache, so this addition enables zipimport to support importlib.invalidate_caches().
I haven't looked at the PR, but why the addition of a new |
Ah I was ambivalent about making |
@desmondcheongzx |
I see, I hadn't looked through that part of importlib. Going through the docs for importlib more closely, it seems that I've renamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to consolidate that code that sets self._files
in invalidate_caches()
and call it from __init__()
to avoid having to make so many stat calls.
Lib/zipimport.py
Outdated
def _get_files(self): | ||
"""Return the files within the archive path.""" | ||
try: | ||
mtime = _bootstrap_external._path_stat(self.archive).st_mtime | ||
except OSError: | ||
mtime = -1 | ||
if mtime != self._archive_mtime: | ||
_zip_directory_cache[self.archive] = _read_directory(self.archive) | ||
self._archive_mtime = mtime | ||
|
||
try: | ||
files = _zip_directory_cache[self.archive] | ||
except KeyError: | ||
files = _read_directory(self.archive) | ||
_zip_directory_cache[self.archive] = files | ||
self._archive_mtime = mtime | ||
|
||
self._files = files | ||
return self._files | ||
|
||
def invalidate_caches(self): | ||
"""Invalidate the mtime of the ZIP archive.""" | ||
self._archive_mtime = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to add the stat call on every access of the importer instead of keeping the cache as-is and instead rebuilding it when invalidate_caches()
is called (i.e. make invalidate_caches()
take over the part of __init__()
that sets self._files
)? Since the common case is going to be no file is changed, I think optimizing performance that way to avoid the stat call makes sense.
You also didn't remove self._files
, which led to two different sets of file data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I was following the pattern in FileFinder.find_spec()
where the cache is refreshed whenever the directory that a FileFinder is handling has been modified. Maybe it isn't so applicable because a zip file isn't modified as often.
If we're removing this check for zip file modification and consolidating setting self._files
in invalidate_caches()
, then it also makes sense to revert _get_files()
back to self._files
since no additional checks are being done on accessing file data. Keeping track of the zip file's mtime probably isn't needed anymore either since a user presumably always wants to call _read_directory()
if invalidate_caches()
is being called.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
This change also reverts the new _get_files() method back to the _files private attribute, and removes the addition of the _archive_mtime attribute.
I have made the requested changes; please review again |
Thanks for making the requested changes! @brettcannon: please review the changes made to this pull request. |
@desmondcheongzx I just wanted to say that I do plan to get back to reviewing this PR; just swamped with stuff. |
@brettcannon I completely understand, thank you for your time and attention :) |
@desmondcheongzx LGTM! I suggested adding yourself to the news entry. Otherwise if you can update the bytecode I |
…alidate-caches-to-zipimporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor nitpick, but otherwise, looks good to me.
@brettcannon thank you for the feedback! I've updated the news entry and the binary for @jaraco I've done as suggested and moved the test for |
Thanks, @desmondcheongzx ! |
def invalidate_caches(self): | ||
"""Reload the file data of the archive path.""" | ||
try: | ||
self._files = _read_directory(self.archive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @desmondcheongzx. I do appreciate this fix.
Just dropping a comment in case the impact of this was missed during the review.
This might impact the performance notably. For example, I only import two zipped packages py4j
and pyspark
(and of course I have many packages installed from pip, and import them within my code) but it slows down the speed of importlib.invalidate_caches
up to 3500% because now it has to go through the zipped package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was missed unfortunately.
To get everyone back up to speed, in the original discussion there were two decisions made:
- The
_get_files()
method should be as fast as possible in the common case where zip files are not changed. This is why we avoided a stat call on every access of the importer. - Rebuild the cache when
invalidate_caches()
is called. This second decision might not be the right one, because clearing a cache does not mean that we want to immediately repopulate it.
There are three relevant events:
_get_files()
is called with a valid cacheinvalidate_caches
is called_get_files()
is called with an invalid cache
IMO, what we really want is for events (1) and (2) to be fast, while event (3) can be slow since we're repopulating a cache. (Note that in the original PR we made (1) and (3) fast, but (2) slow). So I propose the following:
- Add a boolean flag
cache_is_valid
that is set to false wheninvalidate_caches()
is called. - In
_get_files()
, ifcache_is_valid
is true, use the cache. Ifvalid_cache
is false, call_read_directory()
.
This approach avoids any behaviour change and keeps the common path performant, while also shifting the cost of reading the directory out of cache invalidation.
If we want to go further and consider the fact that we rarely expect zip archives to change, then we can also consider adding a flag to importlib.invalidate_caches
so that users can choose whether zip caches are invalidated.
Any thoughts, @brettcannon?
Added an invalidate_caches() method to the zipimport.zipimporter class based on the implementation of importlib.FileFinder.invalidate_caches(). This was done by adding a get_files() method and an _archive_mtime attribute to zipimport.zipimporter to check for updates or cache invalidation whenever the cache of files and toc entry information in the zipimporter is accessed.
https://bugs.python.org/issue14678
Automerge-Triggered-By: GH:brettcannon