Skip to content

Make xarray an optional dependency? #521

@TomNicholas

Description

@TomNicholas

I realised that the in-progess ManifestStore refactor would actually allow us to separate concerns so much that we could potentially make xarray an optional dependency, where you only need xarray installed if you want to use its API to manipulate virtual zarr stores (e.g. by concatenating them).

The result could work like this:

# use a virtual reader directly - no xarray needed
ms: ManifestStore = manifeststore_from_hdf('file.nc')

# write to some virtual references format directly - no xarray needed
# this would use `IcechunkStore.set_virtual_refs()` as it currently does
ms.to_icechunk(icechunkstore)

or if you want to work in xarray space you can move to it:

# xarray required to convert to virtual dataset representation
vds: xr.Dataset = ms.to_virtual_dataset(loadable_variables=...)

# (or just go straight there using our existing API)
vds: xr.Dataset = vz.open_virtual_dataset('file.nc', reader=manifeststore_from_hdf, loadable_variables=...)

# xarray required to do manipulating in xarray space
vds_combined: xr.Dataset = xr.concatenate(vds1, vds2, ...)

# write to some virtual references format - xarray required to write the non-virtual variables
# this could convert the virtual variables to a `ManifestStore` first as well as using `Dataset.to_zarr(icechunkstore)` for the loadable variables as it currently does
vds.to_icechunk(icechunkstore)

Advantages:

  1. Total separation of concerns between virtualizing files into the zarr data model and manipulating them using the xarray data model (this would probably be helpful for fill_value and CF-related stuff too),
  2. Can create virtual zarr references for data that xarray cannot even represent (e.g. multiple arrays with non-alignable dimensions in the same group).

Disadvantages:

  1. Might be less clear for non-expert users, because there are now two ways to read and write references. I still think we would present the xarray interface as the standard UI, we would just mention that this is possible in a developers section of the docs, as ManifestStore is only supposed to be developer API anyway.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions