Skip to content

bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

desmondcheongzx
Copy link
Contributor

@desmondcheongzx desmondcheongzx commented Jan 8, 2021

Added an invalidate_caches() method to the zipimport.zipimporter class based on the implementation of importlib.FileFinder.invalidate_caches(). This was done by adding a get_files() method and an _archive_mtime attribute to zipimport.zipimporter to check for updates or cache invalidation whenever the cache of files and toc entry information in the zipimporter is accessed.

https://bugs.python.org/issue14678

Automerge-Triggered-By: GH:brettcannon

Functions in zipimport and zipimporter methods used to access the
files and toc entries stored in a ZIP archive via zipimporter's
private _files attribute. However, this makes it hard to check
if the ZIP archive had been updated, or if the cached information
has been invalidated whenever _files is accessed.

Hence, this change adds get_files() as an internal accessor for the
_files attribute. Whenever get_files() is called, we do a check to
make sure that our cached information is up-to-date. This check is
done using a new _archive_mtime attribute that stores the last
modified timestamp of the ZIP archive whenever
zipimport._read_directory is called.
Zipimport's zipimporter are the finders that get cached in
sys.path_importer_cache, so this addition enables zipimport to
support importlib.invalidate_caches().
@brettcannon
Copy link
Member

I haven't looked at the PR, but why the addition of a new get_files() public method? Couldn't it have been kept private so as to not unnecessarily expand the public API?

@desmondcheongzx
Copy link
Contributor Author

Ah I was ambivalent about making get_files() public too, so I took the lead of get_resource_reader(), which I saw was a public but undocumented method. But I do agree that it could be made private because it replaces references to the _files private attribute. I'll be more than happy to make the change if this is appropriate.

@brettcannon
Copy link
Member

@desmondcheongzx get_resource_reader() is public as it's necessary for importlib.metadata.

@desmondcheongzx
Copy link
Contributor Author

I see, I hadn't looked through that part of importlib. Going through the docs for importlib more closely, it seems that get_resource_reader() has indeed been documented in multiple places (e.g.). Thanks @brettcannon for pointing me towards the right context.

I've renamed get_files() to a private _get_files() method as suggested.

@brettcannon brettcannon requested a review from jaraco February 5, 2021 22:08
Copy link
Member

@brettcannon brettcannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to consolidate that code that sets self._files in invalidate_caches() and call it from __init__() to avoid having to make so many stat calls.

Lib/zipimport.py Outdated
Comment on lines 324 to 346
def _get_files(self):
"""Return the files within the archive path."""
try:
mtime = _bootstrap_external._path_stat(self.archive).st_mtime
except OSError:
mtime = -1
if mtime != self._archive_mtime:
_zip_directory_cache[self.archive] = _read_directory(self.archive)
self._archive_mtime = mtime

try:
files = _zip_directory_cache[self.archive]
except KeyError:
files = _read_directory(self.archive)
_zip_directory_cache[self.archive] = files
self._archive_mtime = mtime

self._files = files
return self._files

def invalidate_caches(self):
"""Invalidate the mtime of the ZIP archive."""
self._archive_mtime = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to add the stat call on every access of the importer instead of keeping the cache as-is and instead rebuilding it when invalidate_caches() is called (i.e. make invalidate_caches() take over the part of __init__() that sets self._files)? Since the common case is going to be no file is changed, I think optimizing performance that way to avoid the stat call makes sense.

You also didn't remove self._files, which led to two different sets of file data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I was following the pattern in FileFinder.find_spec() where the cache is refreshed whenever the directory that a FileFinder is handling has been modified. Maybe it isn't so applicable because a zip file isn't modified as often.

If we're removing this check for zip file modification and consolidating setting self._files in invalidate_caches(), then it also makes sense to revert _get_files() back to self._files since no additional checks are being done on accessing file data. Keeping track of the zip file's mtime probably isn't needed anymore either since a user presumably always wants to call _read_directory() if invalidate_caches() is being called.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

This change also reverts the new _get_files() method back to the
_files private attribute, and removes the addition of the
_archive_mtime attribute.
@desmondcheongzx
Copy link
Contributor Author

I have made the requested changes; please review again

@bedevere-bot
Copy link

Thanks for making the requested changes!

@brettcannon: please review the changes made to this pull request.

@brettcannon
Copy link
Member

@desmondcheongzx I just wanted to say that I do plan to get back to reviewing this PR; just swamped with stuff.

@desmondcheongzx
Copy link
Contributor Author

@brettcannon I completely understand, thank you for your time and attention :)

@brettcannon
Copy link
Member

brettcannon commented Mar 5, 2021

@desmondcheongzx LGTM! I suggested adding yourself to the news entry. Otherwise if you can update the bytecode I importlib._zipimport.h I will then merge this!

Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor nitpick, but otherwise, looks good to me.

@desmondcheongzx
Copy link
Contributor Author

@brettcannon thank you for the feedback! I've updated the news entry and the binary for importlib._zipimport.h

@jaraco I've done as suggested and moved the test for invalidate_caches into a separate method

@miss-islington miss-islington merged commit 3abf6f0 into python:master Mar 8, 2021
@brettcannon
Copy link
Member

Thanks, @desmondcheongzx !

def invalidate_caches(self):
"""Reload the file data of the archive path."""
try:
self._files = _read_directory(self.archive)
Copy link

@HyukjinKwon HyukjinKwon Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @desmondcheongzx. I do appreciate this fix.

Just dropping a comment in case the impact of this was missed during the review.
This might impact the performance notably. For example, I only import two zipped packages py4j and pyspark (and of course I have many packages installed from pip, and import them within my code) but it slows down the speed of importlib.invalidate_caches up to 3500% because now it has to go through the zipped package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missed unfortunately.

To get everyone back up to speed, in the original discussion there were two decisions made:

  1. The _get_files() method should be as fast as possible in the common case where zip files are not changed. This is why we avoided a stat call on every access of the importer.
  2. Rebuild the cache when invalidate_caches() is called. This second decision might not be the right one, because clearing a cache does not mean that we want to immediately repopulate it.

There are three relevant events:

  1. _get_files() is called with a valid cache
  2. invalidate_caches is called
  3. _get_files() is called with an invalid cache

IMO, what we really want is for events (1) and (2) to be fast, while event (3) can be slow since we're repopulating a cache. (Note that in the original PR we made (1) and (3) fast, but (2) slow). So I propose the following:

  • Add a boolean flag cache_is_valid that is set to false when invalidate_caches() is called.
  • In _get_files(), if cache_is_valid is true, use the cache. If valid_cache is false, call _read_directory().

This approach avoids any behaviour change and keeps the common path performant, while also shifting the cost of reading the directory out of cache invalidation.

If we want to go further and consider the fact that we rarely expect zip archives to change, then we can also consider adding a flag to importlib.invalidate_caches so that users can choose whether zip caches are invalidated.

Any thoughts, @brettcannon?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants