-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Indexing Variable objects with a mask #1751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This will be useful for multi-dimensional reindexing: marking masked items with -1 is exactly the convention used by pandas.Index.get_indexer(). Example usage: In [6]: variable = xr.Variable(('x',), [1, 2, 3]) In [7]: variable._getitem_with_mask([0, 1, 2, -1]) Out[7]: <xarray.Variable (x: 4)> array([ 1., 2., 3., nan]) In [8]: variable._getitem_with_mask(xr.Variable(('x', 'y'), [[0, -1], [-1, 1]]), fill_value=-99) Out[8]: <xarray.Variable (x: 2, y: 2)> array([[ 1, -99], [-99, 2]]) This uses where() so it isn't the most efficient (there is some wasted effort doing indexing, as noted in the TODOs), but the implementation is pretty clean and already works with dask. For now, I'm leaving this as private API, but let's expose it publicly in the future if we are happy with it. I would probably leave it as a Variable method since this is pretty low-level.
It is a great idea! I will look inside the code later, but I feel this should be exposed to the public.
Agreed. |
I pushed another commit (mostly but not entirely working) to port To get a sense of how this effects performance, I made a small benchmarking script with our tutorial dataset: import xarray
import numpy as np
ds_numpy = xarray.tutorial.load_dataset('air_temperature').load()
ds_chunked = ds_numpy.chunk({'time': 100})
lat = np.linspace(ds_numpy.lat.min(), ds_numpy.lat.max(), num=100)
lon = np.linspace(ds_numpy.lon.min(), ds_numpy.lon.max(), num=100)
def do_reindex(ds):
return ds.reindex(lat=lat, lon=lon, method='nearest', tolerance=0.5)
%timeit do_reindex(ds_numpy)
%timeit do_reindex(ds_chunked)
result = do_reindex(ds_chunked)
%timeit result.compute() Our tutorial dataset is pretty small, but it can still give a flavor of how this scales. I chose new chunks intentionally with a small tolerance to create lots of empty chunks to mask:
Here are the benchmarking results: Before:
After:
So NumPy is somewhat slower (about 2.5x), but reindexing with dask is 75x faster! It even shows some ability to parallelize better than pure NumPy. This is encouraging. We should try to close the performance gap with NumPy (it was cleverly optimized before to use minimal copies of the data), but the existing reindex code with dask when doing masking is so slow that it is almost unusable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment :)
|
||
elif isinstance(indexer, VectorizedIndexer): | ||
key = indexer.tuple | ||
base_mask = _masked_result_drop_slice(key, chunks_hint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we take care of more advanced indexers, such as 2d-array or multiple arrays to be broadcasted?
We may leave it for future, but I think this PR would be a good place to add the full feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this already works with advanced indexers, though likely I need more test coverage :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood. Yes, this seems already working.
a784c6c
to
2fbdfcf
Compare
OK, I'm going to merge this (just the first commit), and leave the second part (actually changing reindex) to another PR. |
Could you add more test coverage of the first commit? |
Okay, I'll come up with a few more tests to make sure this maintains 100% coverage... Let me know if you have any ideas for other edge cases. |
That looks fantastic @shoyer , looking forward to testing it :) |
I pushed some additional tests, which turned up the fact that dask's vectorized indexing does not support negative indices (fixed by dask/dask#2967). |
@fujiisoup could you kindly take another look? |
Sorry for my late review. I will see later today. |
This looks good to me. I think it is ready to expose |
I decided to merge in the current state rather than let this get stale. We can add the public API later.... |
This will be useful for multi-dimensional reindexing: marking masked items with
-1 is exactly the convention used by pandas.Index.get_indexer().
Example usage:
This uses where() so it isn't the most efficient (there is some wasted effort
doing indexing, as noted in the TODOs), but the implementation is pretty clean
and already works with dask.
For now, I'm leaving this as private API, but let's expose it publicly in the
future if we are happy with it. I would probably leave it as a Variable method
since this is pretty low-level.
git diff upstream/master **/*py | flake8 --diff
cc @fujiisoup @mraspaud