ideas for daily time grid crunching #372

remicousin · 2023-08-01T18:19:22Z

Namely to do "seasonal" analysis, ie to apply same analysis to a slice of daily data within each year and to do "daily climatology" and back from clim to real time to, for instance, compute anomalies.

There are 2 functions and 2 tests. Tests aren't really tests but rather how I envision the usage (or what would be inside functions easier for the user to call). seasonal_groups creates the groups for seasonal analysis while dayofyear366 creates the groups for climatologies.

The first test function shows one can apply one after the other different functions to seasons only (of course no more seasonal analysis can be done after a reducing function is applied -- and also, one doesn't have to apply a reducing function, the end result is then a normal daily variable, only with data points only at days of seasons). Here seasonal_groups ensures that we keep only complete seasons, but I think it should keep also truncated seasons at extremities and let user slice accordingly if they don't want incomplete seasons. I labeled the groups with the first day of the season, but we should also keep the last day. Or keep the "groups" attribute some how that is a dictionary of lists showing all the points associated with each label.

The climato/ano case... I haven't much to comment. Should we be given a 365-day climato, it would be easy to project it to a 366-day one such as the one I created and fill Feb 29 with something and have the user have a few options about what to fill with.

Also with this set up, we need not create functions for each seasonal (e.g. seasonal_sum, seasonal_onset) or climato/ano analysis (e.g. clim: mean, stddev...; ano: subtract, divide...), but can describe the process in documentation, or make functions calling functions (this time allowing users to see them):
seasonal(daily_data, start_day, start_month, end_day, end_month, [func1, func2...])
climato(daily_data, [func1, func2...])
ano(daily_data, [func1, func2...])

(also, this is not necessarily meant to merge: this is playground.... I thought there was a label or something for that but now I can't see it)

aaron-kaplan

The overall approach of label the data, group by labels, aggregate over groups feels good to me.

I'm not so confident about dayofyear366, because of the confusion around leap days. I've thought about that problem enough to know that thinking about it more probably isn't going to help, one just has to pick one of the several imperfect solutions available. I'm curious why you felt you needed to implement it now. Is isn't relevant to the onset/cessation work as far as I can see.

enacts/calc.py

aaron-kaplan · 2023-08-18T19:09:04Z

enacts/calc.py

+    if end_edges[0] < start_edges[0]:
+        end_edges = end_edges[1:]
+    if start_edges[-1] > end_edges[-1]:
+        start_edges = start_edges[:-2]


I think you're assuming that the T coordinate is dense, i.e. contains every day between its first value and its last. What if it's missing days? Specifically, what if it's missing the season start or end day in some years? I think we should just construct the edges arrays from scratch rather than extracting them from T with sel_day_and_month.

I pushed a proposed solution to this. Have a look.

I think you've fixed the problem. Not sure it's the simplest solution--what if we just generated a dense T coordinate, created start and end edges based on that, and then selected from the dense coordinate using the indices of the sparse one? Implementation detail and thus of secondary concern.

just pushed something else.

The implementation still seems unnecessarily complicated, but that's something to work together on interactively rather than via PR comments, and doesn't need to block progress. What's most important is the interface and the tests.

Regarding the interface, either assign_season_coords should be renamed select_season (select_and_label_season?), or it should only label seasons and not drop data outside of the season (set season start and end to NaT outside the season).

Regarding tests, here are some properties of the function that could be tested:

Behavior when start of the data is before the season, during the season, after the season. Likewise at the end of the data.

Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass.

Confirm that season ends are all the right number of days after season starts, i.e. that we haven't accidentally created seasons that span multiple years.

I think all that was mentioned here has been addressed? except:

"Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass."

You're saying that all my tests are confirming that what I retained is within their appropriate seasons, but I am not testing that I did retain all that I should have?

Yes, that seems to be what I was saying.

I can't wrap my head around what the test(s) could be...

Construct the expected T coordinate, assert that the T coordinate returned by the function is identical to that?

I pushed something to that effect

remicousin · 2023-08-22T17:07:02Z

The overall approach of label the data, group by labels, aggregate over groups feels good to me.

I'm not so confident about dayofyear366, because of the confusion around leap days. I've thought about that problem enough to know that thinking about it more probably isn't going to help, one just has to pick one of the several imperfect solutions available. I'm curious why you felt you needed to implement it now. Is isn't relevant to the onset/cessation work as far as I can see.

I did it then mostly because I had an epiphany and wanted to... battre le fer tant qu'il est chaud. I do want to make a daily climatology for the water balance work so there is a little bit of motivation from there. If we validate that route, it will impact also the onset/cessation work to compute them for multiple years. This is now done by our seasonal_ functions that rely on dailytobegroupedby (that you didn't like) that this work is heavily inspired from. It really is the same thing, only written (API?) differently.

remicousin · 2023-08-30T12:22:57Z

@aaron-kaplan I am going to continue our conversation here. First, I can split the seasonal work from the climato/ano work since they don't share code and since the seasonal thing is more pressing now.

I realize that there is an additional difficulty with the water balance application of seasonal grouping. Which is that water balance accepts multiple inputs and return multiple outputs. It turns out that there is a Dataset.groupby but so that means I also need to make water balance functions accept its inputs as one Dataset, and returns another Dataset. So that I can apply the labelling-dropping function to it, then the grouping function (groupby), then the operator on groups...

remicousin · 2023-08-30T13:37:38Z

Adding Xandre more systematically now on this "core" work... even if only for FYI. This talks about the seasonal case only now. I moved the clim/ano part into another PR.

remicousin · 2023-08-30T18:52:21Z

Revisited this based on Tuesday conversation. Need to work on doc, adding in particular examples of usage. Tests are now real tests. Want to add test for sparse data

aaron-kaplan

My questions about what the function does are important; questions about implementation details are secondary and don't necessarily need to be resolved before we can move forward.

I think this retains partial seasons at the start and end. Is that the right (most useful) behavior? I think I've observed that Ingrid drops partial seasons. But probably in the seasonalAverage function, which does more than this function does.

enacts/calc.py

aaron-kaplan · 2023-09-11T19:24:37Z

enacts/calc.py

@@ -619,6 +620,112 @@ def daily_tobegroupedby_season(
    return daily_tobegroupedby_season


+def assign_season_coords(
+    daily_data,


Seasonal analysis is relevant to monthly and dekadal data too. Should those have their own functions, or should we leave room to generalize this one?

called it time_series. Indeed as I look a the code, it doesn't look like it would care that the data is dekadal or monthly. It's not like xarray understands so much about things being daily anyways.
We might want to get to the bottom of this then and make start_/end_day optional because it would be awkward to ask for them in the "monthly" case

enacts/calc.py

…anyway

remicousin · 2023-10-16T09:23:39Z

@aaron-kaplan can we move this on?

aaron-kaplan · 2023-10-16T20:55:45Z

enacts/calc.py

+        end_month = 3
+        day_offset = -1
+    else:
+        day_offset = 0


Maybe end_month/end_day should always represent the first day that's not included in the season? I.e. day_offset is always -1, then the DJF season would be represented with end_month = 3, end_month = 1, and JFM with end_month = 4, end_day = 1. Advantages: no special case --> less likelihood of a bug in the special case that we don't notice because we failed to test it; corresponds to e.g. how arrays are sliced in python: '012345'[0:4] == '0123', not '01234'`. Disadvantage: for some reason xarray slicing, unlike python slicing, uses intervals that are closed on both ends, so that could be confusing.

It's late in the review cycle for me to be introducing something like this. Feel free to ignore if you don't like the suggestion.

I can see arguments for (or against) either way.

I am not sure there won't be any special case. If the user was a season ending in 28 Feb (included), would they put 29 Feb or 1 Mar? If they put 29 Feb, they would miss 3 28 Feb out of 4; if they put 1 Mar, they would include 29 Feb every 4 years. At least if using sel_day_and_month...

If we envision to function to work also for the monhtly case, then start/end_day should be an optional input and having its default value at 1 would be straightforward with open interval end. If we keep the intervals closed on both ends, we should probably set them to None, have start be 1, and have end be 1 too with an offset of -1 so that when users don't enter days, they get full months. So it's more gymnastics (but not crazy) in that case.

Then, as we try to build on top of xarray (well at least that is what I am trying to do), I feel we should try and opt xarray conventions, and thus keep closed intervals... That's not a very strong argument.

If they put 29 Feb, they would miss 3 28 Feb out of 4

I don't understand. Why would they miss Feb 28?

Xarray's choice to define slices as closed on both ends doesn't make sense to me. You can't partition a range of real values into bins using intervals that are closed on both ends. Indeed, groupby_bins defaults to intervals that are open on the left and closed on the right, which is doubly weird, because it's consistent with neither the behavior of .sel nor the more general convention, which (as I see it) is for intervals to be closed on the left and open on the right.

They would miss it if we keep on using sel_day_and_month

Yes I noticed the seemingly strange behavior of groupby_bins with respect to intervals edges inclusivity. They just reproduced the pandas.cut behavior though. And maybe it is so because this is more about categorizing rather than slicing? Maybe we should not try, use and think through the groupby prism but through the resample one?

enacts/calc.py

ideas for daily time grid crunching

aa1b89d

remicousin requested a review from aaron-kaplan August 1, 2023 18:19

remicousin self-assigned this Aug 1, 2023

aaron-kaplan marked this pull request as draft August 17, 2023 20:37

aaron-kaplan reviewed Aug 18, 2023

View reviewed changes

remicousin added 5 commits August 22, 2023 13:16

better comment (before writing full doc)

a765cbe

use pd to determine leap years

b63b73c

can't start with 29-Feb if there are non-leap years

ffd578b

better naming, commenting (before writing full doc)

13ef033

creating edges array from scratch rather than extracting from time_coord

6157ade

removing the part for climato and ano

b7aeda1

remicousin requested a review from xchourio August 30, 2023 13:36

revisiting groups for seasons

60123c1

aaron-kaplan reviewed Sep 11, 2023

View reviewed changes

remicousin mentioned this pull request Sep 12, 2023

Remic/raincess #351

Merged

remicousin added 3 commits September 20, 2023 16:46

simple implementation of finding the edges

e011614

let user deal with years. Completed doc

dbb355c

generalizing language around time. It's not like xr understand daily …

8e9d302

…anyway

remicousin requested a review from aaron-kaplan October 9, 2023 13:22

aaron-kaplan reviewed Oct 16, 2023

View reviewed changes

enacts/calc.py Show resolved Hide resolved

test that we actually retain all that we want to

5382ccd

ideas for daily time grid crunching #372

Are you sure you want to change the base?

ideas for daily time grid crunching #372

Uh oh!

Conversation

remicousin commented Aug 1, 2023

Uh oh!

aaron-kaplan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aaron-kaplan Oct 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

remicousin commented Aug 22, 2023

Uh oh!

remicousin commented Aug 30, 2023

Uh oh!

remicousin commented Aug 30, 2023

Uh oh!

remicousin commented Aug 30, 2023

Uh oh!

aaron-kaplan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

remicousin commented Oct 16, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aaron-kaplan Oct 25, 2023 •

edited

Loading