bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159

desmondcheongzx · 2021-01-08T02:38:00Z

Added an invalidate_caches() method to the zipimport.zipimporter class based on the implementation of importlib.FileFinder.invalidate_caches(). This was done by adding a get_files() method and an _archive_mtime attribute to zipimport.zipimporter to check for updates or cache invalidation whenever the cache of files and toc entry information in the zipimporter is accessed.

https://bugs.python.org/issue14678

Automerge-Triggered-By: GH:brettcannon

Functions in zipimport and zipimporter methods used to access the files and toc entries stored in a ZIP archive via zipimporter's private _files attribute. However, this makes it hard to check if the ZIP archive had been updated, or if the cached information has been invalidated whenever _files is accessed. Hence, this change adds get_files() as an internal accessor for the _files attribute. Whenever get_files() is called, we do a check to make sure that our cached information is up-to-date. This check is done using a new _archive_mtime attribute that stores the last modified timestamp of the ZIP archive whenever zipimport._read_directory is called.

Zipimport's zipimporter are the finders that get cached in sys.path_importer_cache, so this addition enables zipimport to support importlib.invalidate_caches().

brettcannon · 2021-01-08T17:53:13Z

I haven't looked at the PR, but why the addition of a new get_files() public method? Couldn't it have been kept private so as to not unnecessarily expand the public API?

desmondcheongzx · 2021-01-08T19:36:18Z

Ah I was ambivalent about making get_files() public too, so I took the lead of get_resource_reader(), which I saw was a public but undocumented method. But I do agree that it could be made private because it replaces references to the _files private attribute. I'll be more than happy to make the change if this is appropriate.

brettcannon · 2021-01-11T20:03:34Z

@desmondcheongzx get_resource_reader() is public as it's necessary for importlib.metadata.

desmondcheongzx · 2021-01-11T22:57:27Z

I see, I hadn't looked through that part of importlib. Going through the docs for importlib more closely, it seems that get_resource_reader() has indeed been documented in multiple places (e.g.). Thanks @brettcannon for pointing me towards the right context.

I've renamed get_files() to a private _get_files() method as suggested.

brettcannon

I think it would make more sense to consolidate that code that sets self._files in invalidate_caches() and call it from __init__() to avoid having to make so many stat calls.

brettcannon · 2021-02-05T22:15:27Z

Lib/zipimport.py

+    def _get_files(self):
+        """Return the files within the archive path."""
+        try:
+            mtime = _bootstrap_external._path_stat(self.archive).st_mtime
+        except OSError:
+            mtime = -1
+        if mtime != self._archive_mtime:
+            _zip_directory_cache[self.archive] = _read_directory(self.archive)
+            self._archive_mtime = mtime
+
+        try:
+            files = _zip_directory_cache[self.archive]
+        except KeyError:
+            files = _read_directory(self.archive)
+            _zip_directory_cache[self.archive] = files
+            self._archive_mtime = mtime
+
+        self._files = files
+        return self._files
+
+    def invalidate_caches(self):
+        """Invalidate the mtime of the ZIP archive."""
+        self._archive_mtime = -1


Is there a specific reason to add the stat call on every access of the importer instead of keeping the cache as-is and instead rebuilding it when invalidate_caches() is called (i.e. make invalidate_caches() take over the part of __init__() that sets self._files)? Since the common case is going to be no file is changed, I think optimizing performance that way to avoid the stat call makes sense.

You also didn't remove self._files, which led to two different sets of file data.

In this case I was following the pattern in FileFinder.find_spec() where the cache is refreshed whenever the directory that a FileFinder is handling has been modified. Maybe it isn't so applicable because a zip file isn't modified as often.

If we're removing this check for zip file modification and consolidating setting self._files in invalidate_caches(), then it also makes sense to revert _get_files() back to self._files since no additional checks are being done on accessing file data. Keeping track of the zip file's mtime probably isn't needed anymore either since a user presumably always wants to call _read_directory() if invalidate_caches() is being called.

bedevere-bot · 2021-02-05T22:16:10Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

This change also reverts the new _get_files() method back to the _files private attribute, and removes the addition of the _archive_mtime attribute.

desmondcheongzx · 2021-02-06T03:30:31Z

I have made the requested changes; please review again

bedevere-bot · 2021-02-06T03:30:34Z

Thanks for making the requested changes!

@brettcannon: please review the changes made to this pull request.

brettcannon · 2021-02-26T19:54:35Z

@desmondcheongzx I just wanted to say that I do plan to get back to reviewing this PR; just swamped with stuff.

desmondcheongzx · 2021-02-26T21:13:01Z

@brettcannon I completely understand, thank you for your time and attention :)

Misc/NEWS.d/next/Library/2021-01-07-21-25-49.bpo-14678.1zniCH.rst

brettcannon · 2021-03-05T21:30:01Z

@desmondcheongzx LGTM! I suggested adding yourself to the news entry. Otherwise if you can update the bytecode I importlib._zipimport.h I will then merge this!

…alidate-caches-to-zipimporter

jaraco

One minor nitpick, but otherwise, looks good to me.

Lib/test/test_zipimport.py

desmondcheongzx · 2021-03-06T04:20:00Z

@brettcannon thank you for the feedback! I've updated the news entry and the binary for importlib._zipimport.h

@jaraco I've done as suggested and moved the test for invalidate_caches into a separate method

brettcannon · 2021-03-08T20:06:23Z

Thanks, @desmondcheongzx !

HyukjinKwon · 2023-03-31T04:43:49Z

Lib/zipimport.py

+    def invalidate_caches(self):
+        """Reload the file data of the archive path."""
+        try:
+            self._files = _read_directory(self.archive)


Thanks, @desmondcheongzx. I do appreciate this fix.

Just dropping a comment in case the impact of this was missed during the review.
This might impact the performance notably. For example, I only import two zipped packages py4j and pyspark (and of course I have many packages installed from pip, and import them within my code) but it slows down the speed of importlib.invalidate_caches up to 3500% because now it has to go through the zipped package.

This was missed unfortunately.

To get everyone back up to speed, in the original discussion there were two decisions made:

The _get_files() method should be as fast as possible in the common case where zip files are not changed. This is why we avoided a stat call on every access of the importer.

Rebuild the cache when invalidate_caches() is called. This second decision might not be the right one, because clearing a cache does not mean that we want to immediately repopulate it.

There are three relevant events:

_get_files() is called with a valid cache

invalidate_caches is called

_get_files() is called with an invalid cache

IMO, what we really want is for events (1) and (2) to be fast, while event (3) can be slow since we're repopulating a cache. (Note that in the original PR we made (1) and (3) fast, but (2) slow). So I propose the following:

Add a boolean flag cache_is_valid that is set to false when invalidate_caches() is called.

In _get_files(), if cache_is_valid is true, use the cache. If valid_cache is false, call _read_directory().

This approach avoids any behaviour change and keeps the common path performant, while also shifting the cost of reading the directory out of cache invalidation.

If we want to go further and consider the fact that we rarely expect zip archives to change, then we can also consider adding a flag to importlib.invalidate_caches so that users can choose whether zip caches are invalidated.

Any thoughts, @brettcannon?

desmondcheongzx added 3 commits January 7, 2021 19:45

Add an invalidate_caches() method to zipimporter

4e99b50

Zipimport's zipimporter are the finders that get cached in sys.path_importer_cache, so this addition enables zipimport to support importlib.invalidate_caches().

Add a news entry

1bbd671

desmondcheongzx requested review from brettcannon, encukou, ericsnowcurrently, ncoghlan and warsaw as code owners January 8, 2021 02:38

the-knights-who-say-ni added the CLA signed label Jan 8, 2021

bedevere-bot added the awaiting review label Jan 8, 2021

Add an additional test for zipimporter.invalidate_caches()

e51cb08

Rename get_files() to _get_files()

ccb6c70

brettcannon requested a review from jaraco February 5, 2021 22:08

brettcannon requested changes Feb 5, 2021

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Feb 5, 2021

desmondcheongzx added 2 commits February 5, 2021 22:04

Consolidate updates of _files into invalidate_caches()

ffca660

This change also reverts the new _get_files() method back to the _files private attribute, and removes the addition of the _archive_mtime attribute.

Remove stray newline

5ce2ada

bedevere-bot added awaiting change review and removed awaiting changes labels Feb 6, 2021

bedevere-bot requested a review from brettcannon February 6, 2021 03:30

brettcannon closed this Feb 19, 2021

brettcannon reopened this Feb 19, 2021

brettcannon approved these changes Mar 5, 2021

View reviewed changes

Misc/NEWS.d/next/Library/2021-01-07-21-25-49.bpo-14678.1zniCH.rst Show resolved Hide resolved

bedevere-bot added awaiting merge and removed awaiting change review labels Mar 5, 2021

desmondcheongzx added 2 commits March 5, 2021 18:15

Add author to news entry

d7302fb

Merge remote-tracking branch 'upstream/master' into bpo-14678-add-inv…

b33b9dc

…alidate-caches-to-zipimporter

jaraco approved these changes Mar 6, 2021

View reviewed changes

Lib/test/test_zipimport.py Outdated Show resolved Hide resolved

desmondcheongzx added 2 commits March 5, 2021 22:35

Move tests for invalidate_caches into a new method

bf52347

Update bytecode for importlib._zipimport.h

64d9637

brettcannon added the 🤖 automerge label Mar 8, 2021

miss-islington merged commit 3abf6f0 into python:master Mar 8, 2021

bedevere-bot removed the awaiting merge label Mar 8, 2021

HyukjinKwon reviewed Mar 31, 2023

View reviewed changes

desmondcheongzx mentioned this pull request Apr 2, 2023

zipimport.invalidate_caches() implementation causes performance regression for importlib.invalidate_caches() #103200

Closed

Uh oh!

bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159

bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159

Uh oh!

Conversation

desmondcheongzx commented Jan 8, 2021 • edited by miss-islington Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brettcannon commented Jan 8, 2021

Uh oh!

desmondcheongzx commented Jan 8, 2021

Uh oh!

brettcannon commented Jan 11, 2021

Uh oh!

desmondcheongzx commented Jan 11, 2021

Uh oh!

brettcannon left a comment

Choose a reason for hiding this comment

Uh oh!

brettcannon Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Feb 5, 2021

Uh oh!

desmondcheongzx commented Feb 6, 2021

Uh oh!

bedevere-bot commented Feb 6, 2021

Uh oh!

brettcannon commented Feb 26, 2021

Uh oh!

desmondcheongzx commented Feb 26, 2021

Uh oh!

Uh oh!

brettcannon commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaraco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

desmondcheongzx commented Mar 6, 2021

Uh oh!

brettcannon commented Mar 8, 2021

Uh oh!

HyukjinKwon Mar 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

desmondcheongzx Mar 31, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

desmondcheongzx commented Jan 8, 2021 •

edited by miss-islington

Loading

brettcannon commented Mar 5, 2021 •

edited

Loading

HyukjinKwon Mar 31, 2023 •

edited

Loading