-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Better rolling reductions #4915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -494,6 +527,14 @@ def _numpy_or_bottleneck_reduce( | |||
bottleneck_move_func, keep_attrs=keep_attrs, **kwargs | |||
) | |||
else: | |||
if fillna is not None: | |||
if fillna is dtypes.INF: | |||
fillna = dtypes.get_pos_infinity(self.obj.dtype, max_for_int=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
useless since we always pad with NaN which ends up promoting to float. We should add fill_value
support to rolling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever, looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another look and this looks ready (unless you'd like to also do mean
)
I got mean to work. var is a little involved so haven't done that yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clever implementation @dcherian !
I got mean to work.
Nice ;)
var is a little involved so haven't done that yet.
I think we can just leave the difficult ones with the TODO comment.
Thanks for the reviews @mathause and @fujiisoup |
Has there been any progress on this for var/std? |
We would welcome a PR. Looking at the implementation of Line 160 in 67ff171
|
I think I may have found a way to make it more memory efficient, but I don't know enough about writing the sort of code that would be needed for a PR. I basically wrote out the calculation for variance trying to only use the functions that have already been optimsed. Derived from: I coded this up and demonstrate that it uses approximately 10% of the memory as the current %load_ext memory_profiler
import numpy as np
import xarray as xr
temp = xr.DataArray(np.random.randint(0, 10, (5000, 500)), dims=("x", "y"))
def new_var(da, x=10, y=20):
# Defining the re-used parts
roll = da.rolling(x=x, y=y)
mean = roll.mean()
count = roll.count()
# First term: sum of squared values
term1 = (da**2).rolling(x=x, y=y).sum()
# Second term cross term sum
term2 = -2 * mean * roll.sum()
# Third term 'sum' of squared means
term3 = count * mean**2
# Combining into the variance
var = (term1 + term2 + term3) / count
return var
def old_var(da, x=10, y=20):
roll = da.rolling(x=x, y=y)
var = roll.var()
return var
%memit new_var(temp)
%memit old_var(temp)
I wanted to double check that the calculation was working correctly: print((var_o.where(~np.isnan(var_o), 0) == var_n.where(~np.isnan(var_n), 0)).all().values)
print(np.allclose(var_o, var_n, equal_nan = True))
I think the difference here is just due to floating point errors, but maybe someone who knows how to check that in more detail could have a look. The standard deviation can be trivially implemented from this if the approach works. |
Can you copy your comment to #4325 please? |
Implements most of #4325 (comment)
pre-commit run --all-files
whats-new.rst