Skip to content

Passing multiple kerchunk sideload files to open_mfdataset, not possible with intake #135

@okz

Description

@okz

Standard intake plugins seem to support glob, * or list urlpath's, to consume multiple files with open_mfdataset. This aproach isn't suitable for the intake_xarray.xzarr.ZarrSource plugin since it expects the (urlpath: "reference://"), and uses storage_options::fo to load the sideload file:

    driver: intake_xarray.xzarr.ZarrSource
    args:
      urlpath: "reference://"
      storage_options:
        fo: "sideload.json"

Ideally catalog fo, should be able to accept glob paths ?

More details:

Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.

Using xarray open_mfdataset directly, it was possible to use multiple jsons. e.g:

m_list = []
for js in urls:
    with fs.open(js) as f:
        m_list.append(fsspec.get_mapper("reference://", 
                      fo=ujson.load(f), remote_protocol="file",
                      remote_options=so))

ds = xr.open_mfdataset(m_list, engine='zarr', 
                        combine="nested", 
                        backend_kwargs={
                            "consolidated": False, 
                        },
                        concat_dim="time")

It would have been nice to get rid of this code, and use an intake catalog.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions