Skip to content

bpo-14678: Update zipimport to support importlib.invalidate_caches() #24159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

9 changes: 9 additions & 0 deletions Doc/library/zipimport.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,15 @@ zipimporter Objects

Use :meth:`exec_module` instead.


.. method:: invalidate_caches()

Clear out the internal cache of information about files found within
the ZIP archive.

.. versionadded:: 3.10


.. attribute:: archive

The file name of the importer's associated ZIP file, without a possible
Expand Down
41 changes: 41 additions & 0 deletions Lib/test/test_zipimport.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,47 @@ def testZipImporterMethods(self):
self.assertEqual(zi2.archive, TEMP_ZIP)
self.assertEqual(zi2.prefix, TESTPACK + os.sep)

def testInvalidateCaches(self):
packdir = TESTPACK + os.sep
packdir2 = packdir + TESTPACK2 + os.sep
files = {packdir + "__init__" + pyc_ext: (NOW, test_pyc),
packdir2 + "__init__" + pyc_ext: (NOW, test_pyc),
packdir2 + TESTMOD + pyc_ext: (NOW, test_pyc),
"spam" + pyc_ext: (NOW, test_pyc)}
self.addCleanup(os_helper.unlink, TEMP_ZIP)
with ZipFile(TEMP_ZIP, "w") as z:
for name, (mtime, data) in files.items():
zinfo = ZipInfo(name, time.localtime(mtime))
zinfo.compress_type = self.compression
zinfo.comment = b"spam"
z.writestr(zinfo, data)

zi = zipimport.zipimporter(TEMP_ZIP)
self.assertEqual(zi._files.keys(), files.keys())
# Check that the file information remains accurate after reloading
zi.invalidate_caches()
self.assertEqual(zi._files.keys(), files.keys())
# Add a new file to the ZIP archive
newfile = {"spam2" + pyc_ext: (NOW, test_pyc)}
files.update(newfile)
with ZipFile(TEMP_ZIP, "a") as z:
for name, (mtime, data) in newfile.items():
zinfo = ZipInfo(name, time.localtime(mtime))
zinfo.compress_type = self.compression
zinfo.comment = b"spam"
z.writestr(zinfo, data)
# Check that we can detect the new file after invalidating the cache
zi.invalidate_caches()
self.assertEqual(zi._files.keys(), files.keys())
spec = zi.find_spec('spam2')
self.assertIsNotNone(spec)
self.assertIsInstance(spec.loader, zipimport.zipimporter)
# Check that the cached data is removed if the file is deleted
os.remove(TEMP_ZIP)
zi.invalidate_caches()
self.assertIsNone(zi._files)
self.assertIsNone(zipimport._zip_directory_cache.get(zi.archive))

def testZipImporterMethodsInSubDirectory(self):
packdir = TESTPACK + os.sep
packdir2 = packdir + TESTPACK2 + os.sep
Expand Down
10 changes: 10 additions & 0 deletions Lib/zipimport.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,16 @@ def get_resource_reader(self, fullname):
return ZipReader(self, fullname)


def invalidate_caches(self):
"""Reload the file data of the archive path."""
try:
self._files = _read_directory(self.archive)
Copy link

@HyukjinKwon HyukjinKwon Mar 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @desmondcheongzx. I do appreciate this fix.

Just dropping a comment in case the impact of this was missed during the review.
This might impact the performance notably. For example, I only import two zipped packages py4j and pyspark (and of course I have many packages installed from pip, and import them within my code) but it slows down the speed of importlib.invalidate_caches up to 3500% because now it has to go through the zipped package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missed unfortunately.

To get everyone back up to speed, in the original discussion there were two decisions made:

  1. The _get_files() method should be as fast as possible in the common case where zip files are not changed. This is why we avoided a stat call on every access of the importer.
  2. Rebuild the cache when invalidate_caches() is called. This second decision might not be the right one, because clearing a cache does not mean that we want to immediately repopulate it.

There are three relevant events:

  1. _get_files() is called with a valid cache
  2. invalidate_caches is called
  3. _get_files() is called with an invalid cache

IMO, what we really want is for events (1) and (2) to be fast, while event (3) can be slow since we're repopulating a cache. (Note that in the original PR we made (1) and (3) fast, but (2) slow). So I propose the following:

  • Add a boolean flag cache_is_valid that is set to false when invalidate_caches() is called.
  • In _get_files(), if cache_is_valid is true, use the cache. If valid_cache is false, call _read_directory().

This approach avoids any behaviour change and keeps the common path performant, while also shifting the cost of reading the directory out of cache invalidation.

If we want to go further and consider the fact that we rarely expect zip archives to change, then we can also consider adding a flag to importlib.invalidate_caches so that users can choose whether zip caches are invalidated.

Any thoughts, @brettcannon?

_zip_directory_cache[self.archive] = self._files
except ZipImportError:
_zip_directory_cache.pop(self.archive, None)
self._files = None


def __repr__(self):
return f'<zipimporter object "{self.archive}{path_sep}{self.prefix}">'

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Add an invalidate_caches() method to the zipimport.zipimporter class to
support importlib.invalidate_caches().
Patch by Desmond Cheong.
Loading