You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This saves some memory, but becomes slow for large number of groups.
We could instead do
def anom_vectorized(da):
mean = da.groupby("labels").mean()
mean_expanded = mean.sel(labels=da.labels)
anom = da - mean_expanded
return anom
Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff).
%timeit anom_current(da)
# 1.4 s ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit anom_vectorized(da)
# 937 ms ± 5.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(I haven't experimented with dask yet, so the following is just a theory).
I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better.
Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of mean_expanded exactly matches the chunking of da along the grouped dimension.
In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.
Uh oh!
There was an error while loading. Please reload this page.
By switching to
numpy_groupies
we are vectorizing our groupby reductions. I think we can do the same for groupby's binary ops.Here's an example array
Consider this "anomaly" calculation, anomaly defined relative to the group mean
With this approach, we loop over each group and apply the binary operation:
xarray/xarray/core/computation.py
Lines 502 to 525 in a1635d3
This saves some memory, but becomes slow for large number of groups.
We could instead do
Now we are faster, but construct an extra array as big as the original array (I think this is an OK tradeoff).
(I haven't experimented with dask yet, so the following is just a theory).
I think the real benefit comes with dask. Depending on where the groups are located relative to chunking, we could end up creating a lot of tiny chunks by splitting up existing chunks. With the vectorized approach we can do better.
Ideally we would reindex the "mean" dask array with a numpy-array-of-repeated-ints such that the chunking of
mean_expanded
exactly matches the chunking ofda
along the grouped dimension.In practice, dask.array.take doesn't allow specifying "output chunks" so we'd end up chunking "mean_expanded" based on dask's automatic heuristics, and then rechunking again for the binary operation.Thoughts?
cc @rabernat
The text was updated successfully, but these errors were encountered: