-
-
Notifications
You must be signed in to change notification settings - Fork 357
Description
We are recently seeing a lot of new proposals for new storage classes in zarr (e.g. #299, #294, #293, #252). These are all great ideas. Alternatively, we have several working storage layers (s3fs, gcsfs) that don't live inside zarr because they already provide a MutableMapping interface that zarr can talk to. The situation is fragmented, and we don't see to have a clear roadmap for how to handle all these different scenarios. There is some relevant discussion in #290.
I recently learned about pyfilesystem: "PyFilesystem is a Python module that provides a common interface to any filesystem." The index of supported filesystems provides analogs for nearly all of the builtin zarr storage options. Plus there are storage classes for cloud, ftp, dropbox, etc.
Perhaps one path forward would be to refactor zarr's storage to use pyfilesystem objects. We would only really need a single storage class which wraps pyfilesystem and provides the MutableMapping that zarr uses internally. Then we could remove 80% of storage.py
that deals with listing directories, zip files, etc, since this would be handled by pyfilesystem.
Once we had a generic filesystem, we could then create a Layout layer, which describes how the zarr objects are laid out within the filesystem. For example, today, we already have two de-facto layouts: DirectoryStore
and NestedDirectoryStore
. We could consider others. For example, one with all the metadata in a single file (e.g. #294). The Layout and the Filesystem could be independent from one another.
For new storage layers like mongodb, redis, etc., we would basically just say, "go implement a pyfilesystem for that". This has the advantage of
- reducing the maintenance burden in zarr
- providing more general filesystem objects (that can also be used outside of zarr)
The only con I can think of is performance: it is possible that the pyfilesystem implementations could have worse performance than the zarr built-in ones. But this cuts both ways: they could also perform better!
I know this implies a fairly big refactor of zarr. But it could save us lots of headaches in the long run.