Skip to content

sel slice fails with cftime index when using dask.distributed client #5677

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aidanheerdegen opened this issue Aug 6, 2021 · 2 comments
Closed

Comments

@aidanheerdegen
Copy link
Contributor

What happened: Tried to .sel() a time slice from a multi-file dataset when dask.distributed client active. Got this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: cftime.datetime(2086, 1, 1, 0, 0, 0, 0, calendar='gregorian', has_year_zero=False)

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   5801         try:
-> 5802             slc = self.get_loc(label)
   5803         except KeyError as err:

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/coding/cftimeindex.py in get_loc(self, key, method, tolerance)
    465         else:
--> 466             return pd.Index.get_loc(self, key, method=method, tolerance=tolerance)
    467 

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 

KeyError: cftime.datetime(2086, 1, 1, 0, 0, 0, 0, calendar='gregorian', has_year_zero=False)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
src/cftime/_cftime.pyx in cftime._cftime.datetime.__richcmp__()

src/cftime/_cftime.pyx in cftime._cftime.datetime.change_calendar()

ValueError: change_calendar only works for real-world calendars

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/local/v45/aph502/tmp/ipykernel_108691/1049912036.py in <module>
----> 1 u.sel(time=slice(start_time,end_time))

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/dataarray.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   1313         Dimensions without coordinates: points
   1314         """
-> 1315         ds = self._to_temp_dataset().sel(
   1316             indexers=indexers,
   1317             drop=drop,

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/dataset.py in sel(self, indexers, method, tolerance, drop, **indexers_kwargs)
   2472         """
   2473         indexers = either_dict_or_kwargs(indexers, indexers_kwargs, "sel")
-> 2474         pos_indexers, new_indexes = remap_label_indexers(
   2475             self, indexers=indexers, method=method, tolerance=tolerance
   2476         )

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/coordinates.py in remap_label_indexers(obj, indexers, method, tolerance, **indexers_kwargs)
    419     }
    420 
--> 421     pos_indexers, new_indexes = indexing.remap_label_indexers(
    422         obj, v_indexers, method=method, tolerance=tolerance
    423     )

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/indexing.py in remap_label_indexers(data_obj, indexers, method, tolerance)
    115     for dim, index in indexes.items():
    116         labels = grouped_indexers[dim]
--> 117         idxr, new_idx = index.query(labels, method=method, tolerance=tolerance)
    118         pos_indexers[dim] = idxr
    119         if new_idx is not None:

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/indexes.py in query(self, labels, method, tolerance)
    196 
    197         if isinstance(label, slice):
--> 198             indexer = _query_slice(index, label, coord_name, method, tolerance)
    199         elif is_dict_like(label):
    200             raise ValueError(

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/xarray/core/indexes.py in _query_slice(index, label, coord_name, method, tolerance)
     89             "cannot use ``method`` argument if any indexers are slice objects"
     90         )
---> 91     indexer = index.slice_indexer(
     92         _sanitize_slice_element(label.start),
     93         _sanitize_slice_element(label.stop),

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   5684         slice(1, 3, None)
   5685         """
-> 5686         start_slice, end_slice = self.slice_locs(start, end, step=step)
   5687 
   5688         # return a slice

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   5886         start_slice = None
   5887         if start is not None:
-> 5888             start_slice = self.get_slice_bound(start, "left")
   5889         if start_slice is None:
   5890             start_slice = 0

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   5803         except KeyError as err:
   5804             try:
-> 5805                 return self._searchsorted_monotonic(label, side)
   5806             except ValueError:
   5807                 # raise the original KeyError

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
   5754     def _searchsorted_monotonic(self, label, side: str_t = "left"):
   5755         if self.is_monotonic_increasing:
-> 5756             return self.searchsorted(label, side=side)
   5757         elif self.is_monotonic_decreasing:
   5758             # np.searchsorted expects ascending sort order, have to reverse

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/base.py in searchsorted(self, value, side, sorter)
   1219     @doc(_shared_docs["searchsorted"], klass="Index")
   1220     def searchsorted(self, value, side="left", sorter=None) -> np.ndarray:
-> 1221         return algorithms.searchsorted(self._values, value, side=side, sorter=sorter)
   1222 
   1223     def drop_duplicates(self, keep="first"):

/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/pandas/core/algorithms.py in searchsorted(arr, value, side, sorter)
   1583         arr = ensure_wrapped_if_datetimelike(arr)
   1584 
-> 1585     return arr.searchsorted(value, side=side, sorter=sorter)
   1586 
   1587 

src/cftime/_cftime.pyx in cftime._cftime.datetime.__richcmp__()

TypeError: cannot compare cftime.datetime(2086, 5, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True) and cftime.datetime(2086, 1, 1, 0, 0, 0, 0, calendar='gregorian', has_year_zero=False)

So the slice indexing has created a bounding value with the wrong calendar, should be 365_year but is gregorian.

KeyError: cftime.datetime(2086, 1, 1, 0, 0, 0, 0, calendar='gregorian', has_year_zero=False)

Note that this only happens when a dask.distributed client is loaded

What you expected to happen: expected it to return the same slice it does without error if the client is not active.

Minimal Complete Verifiable Example: I tried really really hard to create a synthetic example but I couldn't make one that
would fail, but loading the mfdataset from disk will make it fail reliably. I have tested multiple times.

The dataset:

xarray.DataArray
'u'
  • time: 15
  • st_ocean: 75
  • yu_ocean: 2700
  • xu_ocean: 3600
  • Array Chunk Bytes 40.74 GiB 3.20 MiB Shape (15, 75, 2700, 3600) (1, 7, 300, 400) Count 26735 Tasks 13365 Chunks Type float32 numpy.ndarray   Array Chunk Bytes 40.74 GiB 3.20 MiB Shape (15, 75, 2700, 3600) (1, 7, 300, 400) Count 26735 Tasks 13365 Chunks Type float32 numpy.ndarray 1513600270075
     
    40.74 GiB 3.20 MiB
    (15, 75, 2700, 3600) (1, 7, 300, 400)
    26735 Tasks 13365 Chunks
    float32 numpy.ndarray
  • Coordinates: 
    • st_ocean
      (st_ocean)
      float64
      0.5413 1.681 ... 5.709e+03
    • time
      (time)
      object
      2085-10-16 12:00:00 ... 2086-12-...
      array([cftime.datetime(2085, 10, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2085, 11, 16, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2085, 12, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 1, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 2, 15, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 3, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 4, 16, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 5, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 6, 16, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 7, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 8, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 9, 16, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 10, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 11, 16, 0, 0, 0, 0, calendar='noleap', has_year_zero=True),
             cftime.datetime(2086, 12, 16, 12, 0, 0, 0, calendar='noleap', has_year_zero=True)],
            dtype=object)
    • xu_ocean
      (xu_ocean)
      float64
      -279.9 -279.8 -279.7 ... 79.9 80.0
    • yu_ocean
      (yu_ocean)
      float64
      -81.09 -81.05 -81.0 ... 89.96 90.0
    • Attributes: 
      long_name :
      i-current
      units :
      m/sec
      valid_range :
      [-10. 10.]
      cell_methods :
      time: mean
      time_avg_info :
      average_T1,average_T2,average_DT
      coordinates :
      geolon_c geolat_c
      standard_name :
      sea_water_x_velocity
      time_bounds :
      <xarray.DataArray 'time_bounds' (time: 15, nv: 2)> dask.array<concatenate, shape=(15, 2), dtype=timedelta64[ns], chunksize=(1, 2), chunktype=numpy.ndarray> Coordinates: * time (time) object 2085-10-16 12:00:00 ... 2086-12-16 12:00:00 * nv (nv) float64 1.0 2.0 Attributes: long_name: time axis boundaries calendar: NOLEAP
# FWIW
start_time = '2086-01-01'
end_time   = '2086-12-31'
u.sel(time=slice(start_time,end_time))

Anything else we need to know?: I tried following the code execution through with pdb and it seems to start going wrong here

def group_indexers_by_index(data_obj, indexers, method=None, tolerance=None):

by line 63 data_obj.xindexes is already in a bad state

xindexes = dict(data_obj.xindexes)

(Pdb) data_obj.xindexes
*** TypeError: cannot compute the time difference between dates with different calendars

It is called here

indexes, grouped_indexers = group_indexers_by_index(
data_obj, indexers, method, tolerance
)

but it isn't obvious to me how that bad state is generated.

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 4.18.0-326.el8.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_AU.utf8
LANG: en_US.UTF-8
LOCALE: ('en_AU', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.7.0
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 2.10.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: 1.2.6
cfgrib: 0.9.9.0
iris: 3.0.4
bottleneck: 1.3.2
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 52.0.0.post20210125
pip: 21.1.3
conda: 4.10.3
pytest: 6.2.4
IPython: 7.26.0
sphinx: 4.1.2

@spencerkclark
Copy link
Member

@aidanheerdegen thanks for the report. Are you sure that you are using cftime version 1.5.0? It is surprising to me that the dates are decoded to cftime.datetime objects and not cftime.DatetimeNoLeap objects. I suspect this is where the problem stems from -- currently xarray does not support the universal base class (#4853).

Eventually the goal is to deprecate the calendar-specific subclasses like cftime.DatetimeNoLeap. For a brief moment -- cftime version 1.4.0 -- the universal class was the default type returned by cftime.num2date, but this proved to be premature, because it broke a significant amount of downstream functionality. More recent versions -- 1.4.1 and later -- have rolled back to returning the subclasses by default. By any chance are you actually using version 1.4.0?

@aidanheerdegen
Copy link
Contributor Author

Thanks for the very prompt response @spencerkclark, the weekend intervened but I have since narrowed it down further so have submitted a new issue

#5686

Will close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants