Skip to content

align() outer join returns DataArrays that are all NaNs #2215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jjpr-mit opened this issue Jun 5, 2018 · 10 comments
Closed

align() outer join returns DataArrays that are all NaNs #2215

jjpr-mit opened this issue Jun 5, 2018 · 10 comments

Comments

@jjpr-mit
Copy link

jjpr-mit commented Jun 5, 2018

Code Sample, a copy-pastable example if possible

The problem occurs for me in the midst of a data-processing pipeline that starts with some ~40MB netCDF files. I've tried to create pasteable code that reproduces the behavior from scratch, but I haven't succeeded.

Problem description

I pass two DataArrays to xr.align() with join="outer". The DataArrays are dtype float64, and contain a mix of NaNs and floats. They are 2D and have MultiIndexes with some numeric and some string levels.

The tuple of DataArrays returned by align() have the correct shape and expected indexes, but the contents of the arrays are all NaNs. The original float values are gone. np.nonzero(~np.isnan(da)) returns an empty array.

I've set breakpoints and delved into the code. On line 656 in xarray.core.variable.Variable._getitem_with_mask, self contains non-NaN values, but the data returned by as_indexable(self._data)[actual_indexer] evaluates as all NaNs. However, data.array at that point (which is xarray.backends.netCDF4_.NetCDF4ArrayWrapper) has non-NaNs. So it's some sort of masking caused by the indexing that makes it look like data is all NaNs.

Expected Output

A tuple of DataArrays which contain some non-NaN values.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-116-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.4
pandas: 0.22.0
numpy: 1.14.0
scipy: 1.0.0
netCDF4: 1.3.1
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: None
distributed: None
matplotlib: 2.1.2
cartopy: None
seaborn: None
setuptools: 38.4.0
pip: 9.0.1
conda: None
pytest: 3.3.2
IPython: 6.2.1
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jun 5, 2018

Are you sure the indexes along the aligned dimensions match exactly? Small differences in floats are the most common source of this issue.

Try using second.reindex_like(first, method='nearest') instead of xarray.align(first, second).

@jjpr-mit
Copy link
Author

jjpr-mit commented Jun 5, 2018

I found a way to reproduce the error. One of the MuliIndex levels on the DataArrays has NaNs in it. If I remove that level, the correct values appear in the result. Should the presence of that MultiIndex level cause this behavior?

import string
import numpy as np
import xarray as xr

dims = ("x", "y")
shape = (10, 5)
das = []
for j in (0, 1):
  data = np.full(shape, np.nan, dtype="float64")
  for i in range(shape[0]):
      data[i, i % shape[1]] = float(i)
  coords_d = {
      "ints": ("x", range(j*shape[0], (j+1)*shape[0])),
      "nans": ("x", np.array([np.nan] * shape[0], dtype="float64")),
      "lower": ("y", list(string.ascii_lowercase[:shape[1]]))
  }
  da = xr.DataArray(data=data, dims=dims, coords=coords_d)
  da.set_index(append=True, inplace=True, x=["ints", "nans"], y=["lower"])
  das.append(da)
nonzeros_raw = [np.nonzero(~np.isnan(da)) for da in das]
print("nonzeros_raw: ")
print(nonzeros_raw)
aligned = xr.align(*das, join="outer")
nonzeros_aligned = [np.nonzero(~np.isnan(da)) for da in aligned]
print("nonzeros_aligned: ")
print(nonzeros_aligned)
assert nonzeros_raw[0].shape == nonzeros_aligned[0].shape

@shoyer
Copy link
Member

shoyer commented Jun 5, 2018

Thanks for the example. Can you please identify exactly which behavior you find surprising, and what you think the result should be?

@jjpr-mit
Copy link
Author

jjpr-mit commented Jun 5, 2018

Since the align is an outer join, I would expect all the non-NaN values in the original DataArrays to also appear in the aligned DataArrays. Perhaps I am misinterpreting the behavior of join="outer".

@jjpr-mit
Copy link
Author

jjpr-mit commented Jun 5, 2018

For clarity, here are the prints of the arrays before and after alignment:

Before alignment:

[<xarray.DataArray (x: 10, y: 5)>
 array([[ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 0 1 2 3 4 5 6 7 8 9
   - nans     (x) float64 nan nan nan nan nan nan nan nan nan nan
   * y        (y) object 'a' 'b' 'c' 'd' 'e', <xarray.DataArray (x: 10, y: 5)>
 array([[ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 10 11 12 13 14 15 16 17 18 19
   - nans     (x) float64 nan nan nan nan nan nan nan nan nan nan
   * y        (y) object 'a' 'b' 'c' 'd' 'e']

After alignment:

(<xarray.DataArray (x: 20, y: 5)>
 array([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
   - nans     (x) object nan nan nan nan nan nan nan nan nan nan nan nan nan ...
   * y        (y) object 'a' 'b' 'c' 'd' 'e', <xarray.DataArray (x: 20, y: 5)>
 array([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
   - nans     (x) object nan nan nan nan nan nan nan nan nan nan nan nan nan ...
   * y        (y) object 'a' 'b' 'c' 'd' 'e')

@shoyer
Copy link
Member

shoyer commented Jun 5, 2018

Since the align is an outer join, I would expect all the non-NaN values in the original DataArrays to also appear in the aligned DataArrays.

Sorry, I'm not quite following -- can we please give a specific example of which output from your example looks wrong, and print how it should look instead?

@jjpr-mit
Copy link
Author

jjpr-mit commented Jun 5, 2018

This is what I would expect to see returned by align():

(<xarray.DataArray (x: 20, y: 5)>
 array([[ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])
 Coordinates:
 * x        (x) MultiIndex
 - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 - nans     (x) object nan nan nan nan nan nan nan nan nan nan nan nan nan ...
 * y        (y) object 'a' 'b' 'c' 'd' 'e', <xarray.DataArray (x: 20, y: 5)>
 array([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.]])
 Coordinates:
 * x        (x) MultiIndex
 - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 - nans     (x) object nan nan nan nan nan nan nan nan nan nan nan nan nan ...
 * y        (y) object 'a' 'b' 'c' 'd' 'e')

I see something very similar, but with the nans level removed, if I do this:
xr.align(*[da.reset_index("nans", drop=True) for da in das], join="outer")

@shoyer
Copy link
Member

shoyer commented Jun 6, 2018

This what I see when printing aligned from your example:

In [26]: aligned
Out[26]:
(<xarray.DataArray (x: 20, y: 5)>
 array([[ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
   - nans     (x) float64 nan nan nan nan nan nan nan nan nan nan nan nan nan ...
   * y        (y) object 'a' 'b' 'c' 'd' 'e', <xarray.DataArray (x: 20, y: 5)>
 array([[nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan],
        [ 0., nan, nan, nan, nan],
        [nan,  1., nan, nan, nan],
        [nan, nan,  2., nan, nan],
        [nan, nan, nan,  3., nan],
        [nan, nan, nan, nan,  4.],
        [ 5., nan, nan, nan, nan],
        [nan,  6., nan, nan, nan],
        [nan, nan,  7., nan, nan],
        [nan, nan, nan,  8., nan],
        [nan, nan, nan, nan,  9.]])
 Coordinates:
   * x        (x) MultiIndex
   - ints     (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
   - nans     (x) float64 nan nan nan nan nan nan nan nan nan nan nan nan nan ...
   * y        (y) object 'a' 'b' 'c' 'd' 'e')

The only material difference I can see in our environments is that I'm running pandas 0.23 and you're running pandas 0.22. Can you try updating pandas and see if that fixes the issue?

@jjpr-mit
Copy link
Author

jjpr-mit commented Jun 13, 2018

@shoyer That did it. Under pandas 0.22, the DataArrays in aligned are all NaNs. I updated to pandas 0.23, and the non-NaN values were there as expected. To double-check, I downgraded to 0.22 again and got all NaNs again.

@shoyer
Copy link
Member

shoyer commented Jun 13, 2018

OK, great. I'm going to close this then, and simply recommend that anyone encounter this issue try upgrading pandas.

@shoyer shoyer closed this as completed Jun 13, 2018
jjpr-mit added a commit to brain-score/brainio_contrib that referenced this issue Oct 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants