Skip to content

Indexing Variable objects with a mask #1751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 18, 2017
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Dec 1, 2017

This will be useful for multi-dimensional reindexing: marking masked items with
-1 is exactly the convention used by pandas.Index.get_indexer().

Example usage:

In [6]: variable = xr.Variable(('x',), [1, 2, 3])

In [7]: variable._getitem_with_mask([0, 1, 2, -1])
Out[7]:
<xarray.Variable (x: 4)>
array([  1.,   2.,   3.,  nan])

In [8]: variable._getitem_with_mask(xr.Variable(('x', 'y'), [[0, -1], [-1, 1]]), fill_value=-99)
Out[8]:
<xarray.Variable (x: 2, y: 2)>
array([[  1, -99],
       [-99,   2]])

This uses where() so it isn't the most efficient (there is some wasted effort
doing indexing, as noted in the TODOs), but the implementation is pretty clean
and already works with dask.

For now, I'm leaving this as private API, but let's expose it publicly in the
future if we are happy with it. I would probably leave it as a Variable method
since this is pretty low-level.

  • Tests added / passed
  • Passes git diff upstream/master **/*py | flake8 --diff

cc @fujiisoup @mraspaud

This will be useful for multi-dimensional reindexing: marking masked items with
-1 is exactly the convention used by pandas.Index.get_indexer().

Example usage:

    In [6]: variable = xr.Variable(('x',), [1, 2, 3])

    In [7]: variable._getitem_with_mask([0, 1, 2, -1])
    Out[7]:
    <xarray.Variable (x: 4)>
    array([  1.,   2.,   3.,  nan])

    In [8]: variable._getitem_with_mask(xr.Variable(('x', 'y'), [[0, -1], [-1, 1]]), fill_value=-99)
    Out[8]:
    <xarray.Variable (x: 2, y: 2)>
    array([[  1, -99],
           [-99,   2]])

This uses where() so it isn't the most efficient (there is some wasted effort
doing indexing, as noted in the TODOs), but the implementation is pretty clean
and already works with dask.

For now, I'm leaving this as private API, but let's expose it publicly in the
future if we are happy with it. I would probably leave it as a Variable method
since this is pretty low-level.
@fujiisoup
Copy link
Member

It is a great idea!
I previously raised #1553 for multidimensional reindex, but am still thinking how could we implement it.
_getitem_with_mask looks cleanest.

I will look inside the code later, but I feel this should be exposed to the public.
In this case, what is the most consistent naming with the existing methods?
isel_mask?

I would probably leave it as a Variable method since this is pretty low-level.

Agreed.

@shoyer
Copy link
Member Author

shoyer commented Dec 1, 2017

I pushed another commit (mostly but not entirely working) to port reindex() to use getitem_with_mask rather than its current implementation. This allows for some nice simplification for the reindex_variables() code: my git change is 1 file changed, 33 insertions(+), 64 deletions(-).

To get a sense of how this effects performance, I made a small benchmarking script with our tutorial dataset:

import xarray
import numpy as np
ds_numpy = xarray.tutorial.load_dataset('air_temperature').load()
ds_chunked = ds_numpy.chunk({'time': 100})
lat = np.linspace(ds_numpy.lat.min(), ds_numpy.lat.max(), num=100)
lon = np.linspace(ds_numpy.lon.min(), ds_numpy.lon.max(), num=100)
def do_reindex(ds):
    return ds.reindex(lat=lat, lon=lon, method='nearest', tolerance=0.5)
%timeit do_reindex(ds_numpy)
%timeit do_reindex(ds_chunked)
result = do_reindex(ds_chunked)
%timeit result.compute()

Our tutorial dataset is pretty small, but it can still give a flavor of how this scales. I chose new chunks intentionally with a small tolerance to create lots of empty chunks to mask:

In [2]: ds_numpy
Out[2]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
  * time     (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
    air      (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

In [3]: do_reindex(ds_numpy)
Out[3]:
<xarray.Dataset>
Dimensions:  (lat: 100, lon: 100, time: 2920)
Coordinates:
  * lat      (lat) float64 15.0 15.61 16.21 16.82 17.42 18.03 18.64 19.24 ...
  * lon      (lon) float64 200.0 201.3 202.6 203.9 205.3 206.6 207.9 209.2 ...
  * time     (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
    air      (time, lat, lon) float64 296.3 nan 296.8 nan 297.1 nan 297.0 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Here are the benchmarking results:

Before:

NumPy: 201 ms ± 3.48 ms per loop
Dask build graph: 303 ms ± 5.48 ms per loop
Dask compute: 30.7 s ± 35.9 ms per loop

After:

NumPy: 546 ms ± 26.3 ms per loop
Dask build graph: 6.9 ms ± 464 µs per loop
Dask compute:  411 ms ± 17.9 ms per loop

So NumPy is somewhat slower (about 2.5x), but reindexing with dask is 75x faster! It even shows some ability to parallelize better than pure NumPy.

This is encouraging. We should try to close the performance gap with NumPy (it was cleverly optimized before to use minimal copies of the data), but the existing reindex code with dask when doing masking is so slow that it is almost unusable.

Copy link
Member

@fujiisoup fujiisoup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment :)


elif isinstance(indexer, VectorizedIndexer):
key = indexer.tuple
base_mask = _masked_result_drop_slice(key, chunks_hint)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we take care of more advanced indexers, such as 2d-array or multiple arrays to be broadcasted?
We may leave it for future, but I think this PR would be a good place to add the full feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this already works with advanced indexers, though likely I need more test coverage :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood. Yes, this seems already working.

@shoyer shoyer force-pushed the getitem-with-mask branch from a784c6c to 2fbdfcf Compare December 4, 2017 02:18
@shoyer
Copy link
Member Author

shoyer commented Dec 4, 2017

OK, I'm going to merge this (just the first commit), and leave the second part (actually changing reindex) to another PR.

@fujiisoup
Copy link
Member

Could you add more test coverage of the first commit?
It would help other developers to follow the logic.

@shoyer
Copy link
Member Author

shoyer commented Dec 4, 2017

Okay, I'll come up with a few more tests to make sure this maintains 100% coverage... Let me know if you have any ideas for other edge cases.

@mraspaud
Copy link
Contributor

mraspaud commented Dec 6, 2017

That looks fantastic @shoyer , looking forward to testing it :)

@shoyer
Copy link
Member Author

shoyer commented Dec 8, 2017

I pushed some additional tests, which turned up the fact that dask's vectorized indexing does not support negative indices (fixed by dask/dask#2967).

@shoyer
Copy link
Member Author

shoyer commented Dec 13, 2017

@fujiisoup could you kindly take another look?

@fujiisoup
Copy link
Member

Sorry for my late review. I will see later today.

@fujiisoup
Copy link
Member

This looks good to me.

I think it is ready to expose _getitem_with_mask to public.
As you mention in #TODO, I also think isel is the best place.

@shoyer shoyer merged commit 4924364 into pydata:master Dec 18, 2017
@shoyer
Copy link
Member Author

shoyer commented Dec 18, 2017

I decided to merge in the current state rather than let this get stale. We can add the public API later....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants