Taking the mean of a long-spanning `np.datetime64` array produces the wrong value #10019

spencerkclark · 2025-02-02T22:34:58Z

What happened?

As noted in #9977 (comment), taking the mean of a np.datetime64 array with values that span more than half of the resolution-dependent range produces the incorrect result, e.g.:

>>> import numpy as np; import xarray as xr
>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

This is due to overflow when computing the timedeltas relative to the minimum datetime value of the array (code).

What did you expect to happen?

This is the outcome we would expect to see in this example:

>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('1969-01-01T00:00:00.000000000', dtype='datetime64[ns]')

Minimal Complete Verifiable Example

>>> import numpy as np; import xarray as xr
>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:25:12) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2024.6.1.dev516+gf6716dc2.d20250112
pandas: 2.2.3
numpy: 2.0.2
scipy: 1.14.1
netCDF4: 1.7.2
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 2.18.4
cftime: 1.6.4
nc_time_axis: None
iris: 3.11.0
bottleneck: 1.4.2
dask: 2024.12.1
distributed: 2024.12.1
matplotlib: 3.10.0
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.12.0
cupy: None
pint: None
sparse: 0.15.4
flox: None
numpy_groupies: None
setuptools: 75.6.0
pip: 24.3.1
conda: None
pytest: 8.3.4
mypy: 1.13.0
IPython: 8.31.0
sphinx: 6.2.1

The text was updated successfully, but these errors were encountered:

spencerkclark · 2025-02-02T23:03:06Z

This is a bit of a tricky problem. Pandas handles it essentially by converting the np.datetime64 values to np.float64 values, computing the mean, and converting back to np.datetime64 (code). This happens to work for this example:

>>> pd.Series(array).mean()
Timestamp('1969-01-01 00:00:00')

but also, due to precision loss, fails on some other examples. For instance, taking the mean of the minimum representable Timestamp erroneously results in "NaT":

>>> pd.Series(pd.Timestamp.min).mean()
NaT

or taking the mean of a time distant from the Unix epoch can change its value:

>>> time = pd.Timestamp.min + pd.to_timedelta(1, "D")
>>> pd.Series(time) == pd.Series(time).mean()
0    False
dtype: bool

Somewhat inspired by numpy/numpy#12901 (comment), a way around the precision loss issue would be to convert to np.longdouble instead of np.float64, but this would only be effective on some platforms, since np.longdouble has a platform-dependent precision (NumPy docs). For instance replacing:

xarray/xarray/core/duck_array_ops.py

Lines 710 to 722 in c252152

    
           offset = _datetime_nanmin(array) 
        
           # From version 2025.01.2 xarray uses np.datetime64[unit], where unit 
        
           # is one of "s", "ms", "us", "ns". 
        
           # To not have to worry about the resolution, we just convert the output 
        
           # to "timedelta64" (without unit) and let the dtype of offset take precedence. 
        
           # This is fully backwards compatible with datetime64[ns]. 
        
           return ( 
        
               _mean( 
        
                   datetime_to_numeric(array, offset), axis=axis, skipna=skipna, **kwargs 
        
               ).astype("timedelta64") 
        
               + offset 
        
           )

with

        array_as_longdouble = where(isnat(array), np.nan, array.astype(np.longdouble))
        return _mean(array_as_longdouble, axis=axis, skipna=skipna, **kwargs).astype(
            array.dtype

produces nice results on some systems where np.longdouble is an 80-bit float, but does not do better than pandas on other systems where np.longdouble is merely a 64-bit float.

kmuehlbauer · 2025-02-07T08:54:07Z

We could calculate the mean year upfront and use that as offset value instead the minimum value. I've drafted up a PR for further discussion: #10035

spencerkclark added bug needs triage Issue that has not been reviewed by xarray team member and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 2, 2025

spencerkclark mentioned this issue Feb 2, 2025

fix mean for datetime-like using the respective time resolution unit #9977

Merged

3 tasks

kmuehlbauer mentioned this issue Feb 7, 2025

use mean of min/max years as offset in calculation of datetime64 mean #10035

Merged

3 tasks

kmuehlbauer closed this as completed in #10035 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Taking the mean of a long-spanning `np.datetime64` array produces the wrong value #10019

Taking the mean of a long-spanning `np.datetime64` array produces the wrong value #10019

spencerkclark commented Feb 2, 2025

spencerkclark commented Feb 2, 2025

Uh oh!

kmuehlbauer commented Feb 7, 2025

Uh oh!

Uh oh!

Taking the mean of a long-spanning np.datetime64 array produces the wrong value #10019

Taking the mean of a long-spanning np.datetime64 array produces the wrong value #10019

Comments

spencerkclark commented Feb 2, 2025

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

spencerkclark commented Feb 2, 2025

Uh oh!

kmuehlbauer commented Feb 7, 2025

Uh oh!

Taking the mean of a long-spanning `np.datetime64` array produces the wrong value #10019

Taking the mean of a long-spanning `np.datetime64` array produces the wrong value #10019