Indexing Variable objects with a mask #1751

shoyer · 2017-12-01T01:12:03Z

This will be useful for multi-dimensional reindexing: marking masked items with
-1 is exactly the convention used by pandas.Index.get_indexer().

Example usage:

In [6]: variable = xr.Variable(('x',), [1, 2, 3])

In [7]: variable._getitem_with_mask([0, 1, 2, -1])
Out[7]:
<xarray.Variable (x: 4)>
array([  1.,   2.,   3.,  nan])

In [8]: variable._getitem_with_mask(xr.Variable(('x', 'y'), [[0, -1], [-1, 1]]), fill_value=-99)
Out[8]:
<xarray.Variable (x: 2, y: 2)>
array([[  1, -99],
       [-99,   2]])

This uses where() so it isn't the most efficient (there is some wasted effort
doing indexing, as noted in the TODOs), but the implementation is pretty clean
and already works with dask.

For now, I'm leaving this as private API, but let's expose it publicly in the
future if we are happy with it. I would probably leave it as a Variable method
since this is pretty low-level.

Tests added / passed
Passes git diff upstream/master **/*py | flake8 --diff

cc @fujiisoup @mraspaud

This will be useful for multi-dimensional reindexing: marking masked items with -1 is exactly the convention used by pandas.Index.get_indexer(). Example usage: In [6]: variable = xr.Variable(('x',), [1, 2, 3]) In [7]: variable._getitem_with_mask([0, 1, 2, -1]) Out[7]: <xarray.Variable (x: 4)> array([ 1., 2., 3., nan]) In [8]: variable._getitem_with_mask(xr.Variable(('x', 'y'), [[0, -1], [-1, 1]]), fill_value=-99) Out[8]: <xarray.Variable (x: 2, y: 2)> array([[ 1, -99], [-99, 2]]) This uses where() so it isn't the most efficient (there is some wasted effort doing indexing, as noted in the TODOs), but the implementation is pretty clean and already works with dask. For now, I'm leaving this as private API, but let's expose it publicly in the future if we are happy with it. I would probably leave it as a Variable method since this is pretty low-level.

fujiisoup · 2017-12-01T01:39:41Z

It is a great idea!
I previously raised #1553 for multidimensional reindex, but am still thinking how could we implement it.
_getitem_with_mask looks cleanest.

I will look inside the code later, but I feel this should be exposed to the public.
In this case, what is the most consistent naming with the existing methods?
isel_mask?

I would probably leave it as a Variable method since this is pretty low-level.

Agreed.

shoyer · 2017-12-01T02:07:21Z

I pushed another commit (mostly but not entirely working) to port reindex() to use getitem_with_mask rather than its current implementation. This allows for some nice simplification for the reindex_variables() code: my git change is 1 file changed, 33 insertions(+), 64 deletions(-).

To get a sense of how this effects performance, I made a small benchmarking script with our tutorial dataset:

import xarray
import numpy as np
ds_numpy = xarray.tutorial.load_dataset('air_temperature').load()
ds_chunked = ds_numpy.chunk({'time': 100})
lat = np.linspace(ds_numpy.lat.min(), ds_numpy.lat.max(), num=100)
lon = np.linspace(ds_numpy.lon.min(), ds_numpy.lon.max(), num=100)
def do_reindex(ds):
    return ds.reindex(lat=lat, lon=lon, method='nearest', tolerance=0.5)
%timeit do_reindex(ds_numpy)
%timeit do_reindex(ds_chunked)
result = do_reindex(ds_chunked)
%timeit result.compute()

Our tutorial dataset is pretty small, but it can still give a flavor of how this scales. I chose new chunks intentionally with a small tolerance to create lots of empty chunks to mask:

In [2]: ds_numpy
Out[2]:
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
  * time     (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
    air      (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

In [3]: do_reindex(ds_numpy)
Out[3]:
<xarray.Dataset>
Dimensions:  (lat: 100, lon: 100, time: 2920)
Coordinates:
  * lat      (lat) float64 15.0 15.61 16.21 16.82 17.42 18.03 18.64 19.24 ...
  * lon      (lon) float64 200.0 201.3 202.6 203.9 205.3 206.6 207.9 209.2 ...
  * time     (time) datetime64[ns] 2013-01-01T00:02:06.757437440 ...
Data variables:
    air      (time, lat, lon) float64 296.3 nan 296.8 nan 297.1 nan 297.0 ...
Attributes:
    Conventions:  COARDS
    title:        4x daily NMC reanalysis (1948)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...

Here are the benchmarking results:

Before:

NumPy: 201 ms ± 3.48 ms per loop
Dask build graph: 303 ms ± 5.48 ms per loop
Dask compute: 30.7 s ± 35.9 ms per loop

After:

NumPy: 546 ms ± 26.3 ms per loop
Dask build graph: 6.9 ms ± 464 µs per loop
Dask compute:  411 ms ± 17.9 ms per loop

So NumPy is somewhat slower (about 2.5x), but reindexing with dask is 75x faster! It even shows some ability to parallelize better than pure NumPy.

This is encouraging. We should try to close the performance gap with NumPy (it was cleverly optimized before to use minimal copies of the data), but the existing reindex code with dask when doing masking is so slow that it is almost unusable.

fujiisoup

One comment :)

fujiisoup · 2017-12-01T13:46:36Z

xarray/core/indexing.py

+
+    elif isinstance(indexer, VectorizedIndexer):
+        key = indexer.tuple
+        base_mask = _masked_result_drop_slice(key, chunks_hint)


Can we take care of more advanced indexers, such as 2d-array or multiple arrays to be broadcasted?
We may leave it for future, but I think this PR would be a good place to add the full feature.

I think this already works with advanced indexers, though likely I need more test coverage :)

I misunderstood. Yes, this seems already working.

shoyer · 2017-12-04T02:19:53Z

OK, I'm going to merge this (just the first commit), and leave the second part (actually changing reindex) to another PR.

fujiisoup · 2017-12-04T03:45:48Z

Could you add more test coverage of the first commit?
It would help other developers to follow the logic.

shoyer · 2017-12-04T06:22:57Z

Okay, I'll come up with a few more tests to make sure this maintains 100% coverage... Let me know if you have any ideas for other edge cases.

mraspaud · 2017-12-06T09:28:16Z

That looks fantastic @shoyer , looking forward to testing it :)

shoyer · 2017-12-08T01:47:51Z

I pushed some additional tests, which turned up the fact that dask's vectorized indexing does not support negative indices (fixed by dask/dask#2967).

shoyer · 2017-12-13T16:53:43Z

@fujiisoup could you kindly take another look?

fujiisoup · 2017-12-14T03:00:34Z

Sorry for my late review. I will see later today.

fujiisoup · 2017-12-14T03:23:05Z

This looks good to me.

I think it is ready to expose _getitem_with_mask to public.
As you mention in #TODO, I also think isel is the best place.

shoyer · 2017-12-18T05:39:18Z

I decided to merge in the current state rather than let this get stale. We can add the public API later....

fujiisoup reviewed Dec 1, 2017

View reviewed changes

shoyer force-pushed the getitem-with-mask branch from a784c6c to 2fbdfcf Compare December 4, 2017 02:18

More tests for _getitem_with_mask & fixes for dask

71a1859

shoyer merged commit 4924364 into pydata:master Dec 18, 2017

shoyer mentioned this pull request Jan 22, 2018

Use getitem_with_mask in reindex_variables #1847

Merged

3 tasks

djhoese mentioned this pull request Jan 22, 2018

Optimize xarray pytroll/pyresample#94

Merged

4 tasks

Uh oh!

Indexing Variable objects with a mask #1751

Indexing Variable objects with a mask #1751

Uh oh!

Conversation

shoyer commented Dec 1, 2017

Uh oh!

fujiisoup commented Dec 1, 2017

Uh oh!

shoyer commented Dec 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fujiisoup left a comment

Choose a reason for hiding this comment

Uh oh!

fujiisoup Dec 1, 2017

Choose a reason for hiding this comment

Uh oh!

shoyer Dec 1, 2017

Choose a reason for hiding this comment

Uh oh!

fujiisoup Dec 2, 2017

Choose a reason for hiding this comment

Uh oh!

shoyer commented Dec 4, 2017

Uh oh!

fujiisoup commented Dec 4, 2017

Uh oh!

shoyer commented Dec 4, 2017

Uh oh!

mraspaud commented Dec 6, 2017

Uh oh!

shoyer commented Dec 8, 2017

Uh oh!

shoyer commented Dec 13, 2017

Uh oh!

fujiisoup commented Dec 14, 2017

Uh oh!

fujiisoup commented Dec 14, 2017

Uh oh!

shoyer commented Dec 18, 2017

Uh oh!

Uh oh!

shoyer commented Dec 1, 2017 •

edited

Loading