Description
Opening an issue here for a discussion topic that came up today: What is the optimal chunking we should have for our files.
Thanks to @tcompa 's work on improved chunking here, we now have the ability to rechunk zarr files and to save them with desired chunk sizes at all pyramid levels, see here: fractal-analytics-platform/fractal-client#32
What is the optimal size we should chunk files as?
There are two main concerns that may go in sync for a while, but may also become trade-off.
Concern 1)
Interactive accessibility of the data.
Specifically, how nicely can we read in just the chunk of data we are interested in and what kind of access patterns will we have (e.g. at the moment, we are loosely optimizing for displaying xy layers of the data, but not to display xz layers). Also, what is the optimal chunk size for visualization? Increasing chunk sizes for lower-res pyramid structures helped a lot in interactive performance. Would further increases still help or start to hinder performance?
Concern 2)
Number of folders and files saved to the filesystem
I'm no filesystem expert, what we've already started noticing that copying zarr files is quite slow, because it's a very nested structure. And IT at the FMI became worried about file numbers today when we presented our plans for fractal and OME-Zarr usage with regard of what it would mean about number of files needing to be saved. Their preference would be that we'd have equal or fewer files that the raw data.
Both concerns are things we should explore further to find good tradeoffs and exchange with e.g. the OME-Zarr developers about how they handle this on their side.
Also, Zarr has support for reading from zip stores directly (=> would just be 1 file on disk, right?), see her: https://zarr.readthedocs.io/en/stable/tutorial.html#storage-alternatives
Additionally, there are cloud storage approaches for Zarrs for AWS, Google Cloud Storage etc. that may be worth having a look at.