Description
This may end up being a super obvious mistake for someone that's a bit more familiar with the expectations of the kerchunk
API, so forgive me if this issue is being lodged out of ignorance!
The workflow I'm dealing with is attempting to combine kerchunk datasets via map/reduce. The reduce step involves an assumption of associativity among things combined, as different numbers of workers for the same job will mean different combinations being created as each worker's kerchunk refs are combined via MultiZarrToZarr.translate()
and then again on the driver node to hopefully get out a single ref for all underlying datasets.
Assumption:
MultiZarrToZarr combination and then translation is associative. Combining refs can happen in any order and then be combined again at the end of the process without different results.
Reality:
I'm seeing radically different results depending on how many workers are used, suggesting that associativity of this combination is not a safe assumption!
This is roughly what that workflow looks like:
Taking a list of netcdf files (with the same dimensions) and translate:
chunks = NetCDF3ToZarr(
url,
inline_threshold=inline_threshold,
storage_options=storage_options,
**(kerchunk_open_kwargs or {}),
)
refs = [chunks.translate()]
These dictionaries are distributed across workers, each of which builds a MultiZarrToZarr
instance (Map):
# list[dict]) -> MultiZarrToZarr
MultiZarrToZarr(refs)
Each worker's MultiZarrToZarr
is then translated and merged via another MultiZarrToZarr
(reduce):
# Sequence[MultiZarrToZarr]) -> MultiZarrToZarr
refs = [a.translate() for a in multizarrtozarr]
accumulator = MultiZarrToZarr(refs)
Finally, the results are written out as a single ref:
# MultiZarrToZarr -> dict
accumulator.translate()
At this point, the results differ. Some statistics I've pulled from the resulting ref files. Note, especially, the different Non-NAN value count:
Stats for analysis_error (single worker):
Mean: 0.3497554361820221
Median: 0.3499999940395355
Standard Deviation: 0.004802505951374769
Non-NaN Count: 137376947
Stats for analysis_error (4 workers):
Mean: 0.3497870862483978
Median: 0.3499999940395355
Standard Deviation: 0.004866031929850578
Non-NaN Count: 109904479