Skip to content

ideas for daily time grid crunching #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from
Draft

Conversation

remicousin
Copy link
Contributor

Namely to do "seasonal" analysis, ie to apply same analysis to a slice of daily data within each year and to do "daily climatology" and back from clim to real time to, for instance, compute anomalies.

There are 2 functions and 2 tests. Tests aren't really tests but rather how I envision the usage (or what would be inside functions easier for the user to call). seasonal_groups creates the groups for seasonal analysis while dayofyear366 creates the groups for climatologies.

The first test function shows one can apply one after the other different functions to seasons only (of course no more seasonal analysis can be done after a reducing function is applied -- and also, one doesn't have to apply a reducing function, the end result is then a normal daily variable, only with data points only at days of seasons). Here seasonal_groups ensures that we keep only complete seasons, but I think it should keep also truncated seasons at extremities and let user slice accordingly if they don't want incomplete seasons. I labeled the groups with the first day of the season, but we should also keep the last day. Or keep the "groups" attribute some how that is a dictionary of lists showing all the points associated with each label.

The climato/ano case... I haven't much to comment. Should we be given a 365-day climato, it would be easy to project it to a 366-day one such as the one I created and fill Feb 29 with something and have the user have a few options about what to fill with.

Also with this set up, we need not create functions for each seasonal (e.g. seasonal_sum, seasonal_onset) or climato/ano analysis (e.g. clim: mean, stddev...; ano: subtract, divide...), but can describe the process in documentation, or make functions calling functions (this time allowing users to see them):
seasonal(daily_data, start_day, start_month, end_day, end_month, [func1, func2...])
climato(daily_data, [func1, func2...])
ano(daily_data, [func1, func2...])

(also, this is not necessarily meant to merge: this is playground.... I thought there was a label or something for that but now I can't see it)

@remicousin remicousin requested a review from aaron-kaplan August 1, 2023 18:19
@remicousin remicousin self-assigned this Aug 1, 2023
@aaron-kaplan aaron-kaplan marked this pull request as draft August 17, 2023 20:37
Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall approach of label the data, group by labels, aggregate over groups feels good to me.

I'm not so confident about dayofyear366, because of the confusion around leap days. I've thought about that problem enough to know that thinking about it more probably isn't going to help, one just has to pick one of the several imperfect solutions available. I'm curious why you felt you needed to implement it now. Is isn't relevant to the onset/cessation work as far as I can see.

enacts/calc.py Outdated
if end_edges[0] < start_edges[0]:
end_edges = end_edges[1:]
if start_edges[-1] > end_edges[-1]:
start_edges = start_edges[:-2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're assuming that the T coordinate is dense, i.e. contains every day between its first value and its last. What if it's missing days? Specifically, what if it's missing the season start or end day in some years? I think we should just construct the edges arrays from scratch rather than extracting them from T with sel_day_and_month.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a proposed solution to this. Have a look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've fixed the problem. Not sure it's the simplest solution--what if we just generated a dense T coordinate, created start and end edges based on that, and then selected from the dense coordinate using the indices of the sparse one? Implementation detail and thus of secondary concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pushed something else.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation still seems unnecessarily complicated, but that's something to work together on interactively rather than via PR comments, and doesn't need to block progress. What's most important is the interface and the tests.

Regarding the interface, either assign_season_coords should be renamed select_season (select_and_label_season?), or it should only label seasons and not drop data outside of the season (set season start and end to NaT outside the season).

Regarding tests, here are some properties of the function that could be tested:

  • Behavior when start of the data is before the season, during the season, after the season. Likewise at the end of the data.
  • Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass.
  • Confirm that season ends are all the right number of days after season starts, i.e. that we haven't accidentally created seasons that span multiple years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all that was mentioned here has been addressed? except:

"Confirm that all the seasons in the range of the data have been labeled/retained. If we mistakenly considered all of the data points to be out of season, the assertion you're currently using would pass."

You're saying that all my tests are confirming that what I retained is within their appropriate seasons, but I am not testing that I did retain all that I should have?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems to be what I was saying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't wrap my head around what the test(s) could be...

Copy link
Collaborator

@aaron-kaplan aaron-kaplan Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Construct the expected T coordinate, assert that the T coordinate returned by the function is identical to that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed something to that effect

@remicousin
Copy link
Contributor Author

The overall approach of label the data, group by labels, aggregate over groups feels good to me.

I'm not so confident about dayofyear366, because of the confusion around leap days. I've thought about that problem enough to know that thinking about it more probably isn't going to help, one just has to pick one of the several imperfect solutions available. I'm curious why you felt you needed to implement it now. Is isn't relevant to the onset/cessation work as far as I can see.

I did it then mostly because I had an epiphany and wanted to... battre le fer tant qu'il est chaud. I do want to make a daily climatology for the water balance work so there is a little bit of motivation from there. If we validate that route, it will impact also the onset/cessation work to compute them for multiple years. This is now done by our seasonal_ functions that rely on dailytobegroupedby (that you didn't like) that this work is heavily inspired from. It really is the same thing, only written (API?) differently.

@remicousin
Copy link
Contributor Author

@aaron-kaplan I am going to continue our conversation here. First, I can split the seasonal work from the climato/ano work since they don't share code and since the seasonal thing is more pressing now.

I realize that there is an additional difficulty with the water balance application of seasonal grouping. Which is that water balance accepts multiple inputs and return multiple outputs. It turns out that there is a Dataset.groupby but so that means I also need to make water balance functions accept its inputs as one Dataset, and returns another Dataset. So that I can apply the labelling-dropping function to it, then the grouping function (groupby), then the operator on groups...

@remicousin remicousin requested a review from xchourio August 30, 2023 13:36
@remicousin
Copy link
Contributor Author

Adding Xandre more systematically now on this "core" work... even if only for FYI. This talks about the seasonal case only now. I moved the clim/ano part into another PR.

@remicousin
Copy link
Contributor Author

Revisited this based on Tuesday conversation. Need to work on doc, adding in particular examples of usage. Tests are now real tests. Want to add test for sparse data

Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions about what the function does are important; questions about implementation details are secondary and don't necessarily need to be resolved before we can move forward.

I think this retains partial seasons at the start and end. Is that the right (most useful) behavior? I think I've observed that Ingrid drops partial seasons. But probably in the seasonalAverage function, which does more than this function does.

enacts/calc.py Outdated
@@ -619,6 +620,112 @@ def daily_tobegroupedby_season(
return daily_tobegroupedby_season


def assign_season_coords(
daily_data,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seasonal analysis is relevant to monthly and dekadal data too. Should those have their own functions, or should we leave room to generalize this one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called it time_series. Indeed as I look a the code, it doesn't look like it would care that the data is dekadal or monthly. It's not like xarray understands so much about things being daily anyways.
We might want to get to the bottom of this then and make start_/end_day optional because it would be awkward to ask for them in the "monthly" case

@remicousin remicousin mentioned this pull request Sep 12, 2023
@remicousin
Copy link
Contributor Author

@aaron-kaplan can we move this on?

end_month = 3
day_offset = -1
else:
day_offset = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe end_month/end_day should always represent the first day that's not included in the season? I.e. day_offset is always -1, then the DJF season would be represented with end_month = 3, end_month = 1, and JFM with end_month = 4, end_day = 1. Advantages: no special case --> less likelihood of a bug in the special case that we don't notice because we failed to test it; corresponds to e.g. how arrays are sliced in python: '012345'[0:4] == '0123', not '01234'`. Disadvantage: for some reason xarray slicing, unlike python slicing, uses intervals that are closed on both ends, so that could be confusing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's late in the review cycle for me to be introducing something like this. Feel free to ignore if you don't like the suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see arguments for (or against) either way.

I am not sure there won't be any special case. If the user was a season ending in 28 Feb (included), would they put 29 Feb or 1 Mar? If they put 29 Feb, they would miss 3 28 Feb out of 4; if they put 1 Mar, they would include 29 Feb every 4 years. At least if using sel_day_and_month...

If we envision to function to work also for the monhtly case, then start/end_day should be an optional input and having its default value at 1 would be straightforward with open interval end. If we keep the intervals closed on both ends, we should probably set them to None, have start be 1, and have end be 1 too with an offset of -1 so that when users don't enter days, they get full months. So it's more gymnastics (but not crazy) in that case.

Then, as we try to build on top of xarray (well at least that is what I am trying to do), I feel we should try and opt xarray conventions, and thus keep closed intervals... That's not a very strong argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they put 29 Feb, they would miss 3 28 Feb out of 4

I don't understand. Why would they miss Feb 28?

Xarray's choice to define slices as closed on both ends doesn't make sense to me. You can't partition a range of real values into bins using intervals that are closed on both ends. Indeed, groupby_bins defaults to intervals that are open on the left and closed on the right, which is doubly weird, because it's consistent with neither the behavior of .sel nor the more general convention, which (as I see it) is for intervals to be closed on the left and open on the right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would miss it if we keep on using sel_day_and_month

Yes I noticed the seemingly strange behavior of groupby_bins with respect to intervals edges inclusivity. They just reproduced the pandas.cut behavior though. And maybe it is so because this is more about categorizing rather than slicing? Maybe we should not try, use and think through the groupby prism but through the resample one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants