-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Standard intake plugins seem to support glob, *
or list
urlpath's, to consume multiple files with open_mfdataset
. This aproach isn't suitable for the intake_xarray.xzarr.ZarrSource
plugin since it expects the (urlpath: "reference://")
, and uses storage_options::fo
to load the sideload file:
driver: intake_xarray.xzarr.ZarrSource
args:
urlpath: "reference://"
storage_options:
fo: "sideload.json"
Ideally catalog fo
, should be able to accept glob
paths ?
More details:
Having many netcdf files with variable dimensions, we hit the "irregular chunk size between files issue" trying to use kerchunk.
So instead of combining netcdf files, to a single sideload json file, we created a sideload .json for each netcdf file, and let xarray take care of the merge. For our datasets this was good enough, and made working with several months of remote data, possible.
Using xarray open_mfdataset
directly, it was possible to use multiple jsons. e.g:
m_list = []
for js in urls:
with fs.open(js) as f:
m_list.append(fsspec.get_mapper("reference://",
fo=ujson.load(f), remote_protocol="file",
remote_options=so))
ds = xr.open_mfdataset(m_list, engine='zarr',
combine="nested",
backend_kwargs={
"consolidated": False,
},
concat_dim="time")
It would have been nice to get rid of this code, and use an intake catalog.