Skip to content

xindexes set incorrectly for mfdataset with dask client and parallel=True #5686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aidanheerdegen opened this issue Aug 9, 2021 · 8 comments

Comments

@aidanheerdegen
Copy link
Contributor

What happened: Using open_mfdataset with parallel=True with a dask.distributed client active fails to set .xindexes correctly.

What you expected to happen: The indexes should contain an index that can be printed correctly. When using repr the .xindexes fails with TypeError: cannot compute the time difference between dates with different calendars due to an error in .asi8

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np
from dask.distributed import Client

# Need a main routine for dask.distributed if run as script
if __name__ == "__main__":

    client = Client(n_workers=1) 

    # Create some synthetic data
    time_365_decade = xr.cftime_range(start="2100", periods=120, freq="1MS", calendar="noleap")

    ds = xr.Dataset(
            {"a": ("time", np.arange(time_365_decade.size))},
            coords={"time": time_365_decade},
    )   

    index_microseconds = ds.xindexes['time'].array.asi8

    # Save to a file per year
    years, datasets = zip(*ds.groupby("time.year"))
    xr.save_mfdataset(datasets, [f"{y}.nc" for y in years])

    # Open saved files, parallel=False and asi8 ok
    assert (index_microseconds == xr.open_mfdataset('2???.nc', parallel=False).xindexes['time'].array.asi8).all()

    # Open saved files, parallel=True and asi8 fails
    assert (index_microseconds == xr.open_mfdataset('2???.nc', parallel=True).xindexes['time'].array.asi8).all()

Anything else we need to know?: the asi8 function fails

https://github.com/pydata/xarray/blob/main/xarray/coding/cftimeindex.py#L677

because

epoch = self.date_type(1970, 1, 1)

returns a cftime.datetime with a calendar and has_year_zero attribute that do not match the index

(Pdb) p epoch
cftime.datetime(1970, 1, 1, 0, 0, 0, 0, calendar='gregorian', has_year_zero=False)

Previously reported this as #5677

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-305.7.1.el8.nci.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_AU.utf8
LANG: en_AU.ISO8859-1
LOCALE: ('en_AU', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.1
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 2.10.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: 1.2.6
cfgrib: 0.9.9.0
iris: 3.0.4
bottleneck: 1.3.2
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 52.0.0.post20210125
pip: 21.1.3
conda: 4.10.3
pytest: 6.2.4
IPython: 7.26.0
sphinx: 4.1.2

@shoyer
Copy link
Member

shoyer commented Aug 9, 2021

Thanks for the updated report! Could you kindly share the full error traceback?

@spencerkclark
Copy link
Member

@aidanheerdegen thanks a lot for the minimal example -- I'm able to reproduce it now -- this is indeed a confusing bug! I think this is a symptom of the same problem I identified in #5677: the dates in your dataset are being decoded to cftime.datetime objects instead of cftime.DatetimeNoLeap objects. This causes problems downstream in a variety of places in xarray, because xarray currently infers the date and calendar type of the index by checking the type of the dates it contains.

The question is: why is this happening? I initially thought this could only happen if you were using cftime version 1.4.0, but that is clearly not true. It is interesting that this only comes up when using a distributed client and parallel=True in open_mfdataset, while with other more basic approaches the dates are decoded properly to cftime.DatetimeNoLeap objects. I think a more minimal example of this issue may be the following:

>>> import cftime; import dask; import distributed
>>> cftime.__version__
'1.5.0'
>>> dask.__version__
'2021.07.2'
>>> distributed.__version__
'2021.07.2'
>>> client = distributed.Client()
>>> cftime.num2date([0, 1, 2], units="days since 2000-01-01", calendar="noleap")
array([cftime.DatetimeNoLeap(2000, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2000, 1, 2, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2000, 1, 3, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)
>>> delayed_num2date = dask.delayed(cftime.num2date)
>>> delayed_num2date([0, 1, 2], units="days since 2000-01-01", calendar="noleap").compute()
array([cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
       cftime.datetime(2000, 1, 2, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
       cftime.datetime(2000, 1, 3, 0, 0, 0, 0, calendar='noleap', has_year_zero=True)],
      dtype=object)

Note that when using delayed in conjunction with a distributed.Client the dates are decoded to cftime.datetime objects instead of cftime.DatetimeNoLeap objects. This fairly clearly demonstrates that this is an upstream issue -- likely in cftime -- but I need to dig a little more to determine exactly what the issue is.

@aidanheerdegen
Copy link
Contributor Author

Thanks for the updated report! Could you kindly share the full error traceback?

Sorry, see below

Traceback (most recent call last):
  File "/g/data/v45/aph502/helpdesk/fromgithub/20210804-Navid/mcve.py", line 28, in <module>
    assert (index_microseconds == xr.open_mfdataset('2???.nc', parallel=True).xindexes['time'].array.asi8).all()
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/coding/cftimeindex.py", line 683, in asi8
    [
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/coding/cftimeindex.py", line 684, in <listcomp>
    _total_microseconds(exact_cftime_datetime_difference(epoch, date))
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/resample_cftime.py", line 358, in exact_cftime_datetime_difference
    seconds = b.replace(microsecond=0) - a.replace(microsecond=0)
  File "src/cftime/_cftime.pyx", line 1369, in cftime._cftime.datetime.__sub__
TypeError: cannot compute the time difference between dates with different calendars

@aidanheerdegen
Copy link
Contributor Author

Thanks again for the prompt response @spencerkclark. Yes your MCVE is more (less?) M than mine. Thanks.

Perhaps I shouldn't have started a new issue, but it seemed the specific problem with .sel was just a knock on effect from this cftime issue.

I should have said in #5677 that as far as I could tell I was using cftime=1.5.0.

@aidanheerdegen
Copy link
Contributor Author

A colleague suggested it might be some sort of pickling issue, passing the generated object back to the main thread, but it was just speculation and I had no idea how to test that.

@spencerkclark
Copy link
Member

A colleague suggested it might be some sort of pickling issue, passing the generated object back to the main thread, but it was just speculation and I had no idea how to test that.

Yes, it must be something of that sort. Here's perhaps an even more minimal example:

>>> import cftime; import distributed
>>> header, frames = distributed.protocol.serialize(cftime.DatetimeNoLeap(2000, 1, 1))
>>> distributed.protocol.deserialize(header, frames)
cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='noleap', has_year_zero=True)

Further, removing distributed from the mix, we can show this just using pickle:

>>> import pickle
>>> serialized = pickle.dumps(cftime.DatetimeNoLeap(2000, 1, 1))
>>> deserialized = pickle.loads(serialized)
>>> deserialized
cftime.datetime(2000, 1, 1, 0, 0, 0, 0, calendar='noleap', has_year_zero=True)

I'll make an issue in cftime.

@spencerkclark
Copy link
Member

@aidanheerdegen I tested your example with my upstream changes (Unidata/cftime#252) and it now works, so this should be fixed with the next release of cftime. I'll go ahead and close this issue. Thanks for your help debugging this!

@aidanheerdegen
Copy link
Contributor Author

Thanks for the super fast fix. I have confirmed this fixes #5677

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants