Skip to content

Resync resampler on system time changes #802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

matthias-wende-frequenz
Copy link
Contributor

The resampler was initializing an internal time on initialization by syncing itself to the system clock.

This sync was never redone and thus lead to out-of-sync issues when the system clock changed, which could cause that the resampling window was empty even when samples, with the new system time, arrived.

@github-actions github-actions bot added the part:data-pipeline Affects the data pipeline label Dec 1, 2023
@matthias-wende-frequenz
Copy link
Contributor Author

We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think?

@matthias-wende-frequenz matthias-wende-frequenz added the priority:URGENT Address this immediately or the sky will fall label Dec 1, 2023
@llucax
Copy link
Contributor

llucax commented Dec 1, 2023

We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think?

What's the plan here? If the plan is that this is tested in production (or somewhere else), any reason not to just temporarily point the SDK dependency to your fork's branch (https://github.com/matthias-wende-frequenz/frequenz-sdk-python/tree/fix_resampling)?

I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense.

@matthias-wende-frequenz
Copy link
Contributor Author

matthias-wende-frequenz commented Dec 1, 2023

I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense.

I was thinking that we can unblock the fix. But I feel it's worth adding the tests before merging. Also the tests are almost ready. Will add them to this PR soon.

@github-actions github-actions bot added the part:docs Affects the documentation label Dec 2, 2023
@github-actions github-actions bot added the part:tests Affects the unit, integration and performance (benchmarks) tests label Dec 4, 2023
@matthias-wende-frequenz matthias-wende-frequenz force-pushed the fix_resampling branch 4 times, most recently from d64c7c2 to f0ecc06 Compare December 4, 2023 21:47
@matthias-wende-frequenz matthias-wende-frequenz marked this pull request as ready for review December 5, 2023 12:08
@matthias-wende-frequenz matthias-wende-frequenz requested a review from a team as a code owner December 5, 2023 12:08
Signed-off-by: Matthias Wende <[email protected]>
Adds a boolean parameter if an extra period should be added
to the calculated window end or not.
Signed-off-by: Matthias Wende <[email protected]>
Signed-off-by: Matthias Wende <[email protected]>
Signed-off-by: Matthias Wende <[email protected]>
@@ -457,6 +449,34 @@ def remove_timeseries(self, source: Source) -> bool:
return False
return True

def _sync_timer(self, extra_period: bool = True) -> datetime:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly, but I would either remove this extra period, as is it only a startup glitch we are trying to fix (I guess it made sense in the original code because it was free), or to switch the default, as the normal situation seems to be not adding an extra period.

# new samples aligned to the new system time have been received.
# Thus we resync `self._window_end` to the new system time in case
# it drifted more then one resampling period away from the system time.
if abs(self._window_end - now) - drift > self._config.resampling_period:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly this is still not enough. Both drifts need to be handled independently, the mixing of both could end up hiding issues. For example:

monotonic clock
0         1         2         3         4         5         6 
|---------|---------|---------|---------|---------|---------|---------> time
o         o         o         o         o         o
          t1        t2        t3        t4        t5                    expected ticks
            T1       T2       T3                     T4                 observed ticks
O           *        *        *                      *
|------------|------------|------------|------------|------------|-------> time
0            1            2            3            4            5
wall clock

resampling_period = 1.0, and the wall clock has an accumulating drift of 0.3s per second.

T1:
* now = ~0.9
* window_end = 1.0
* drift = 0.2
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             1.0 - 0.9) - 0.2   > 1.0  
                          0.1 - 0.2   > 1.0
                               -0.1   > 1.0 => False

T2:
* now = ~1.7
* window_end = 2.0
* drift = 0.1
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             2.0 - 1.7) - 0.1   > 1.0  
                          0.3 - 0.1   > 1.0
                                0.2   > 1.0 => False

T3:
* now = ~2.4
* window_end = 3.0
* drift = 0
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             3.0 - 2.4) - 0.0   > 1.0  
                          0.6 - 0.0   > 1.0
                                0.6   > 1.0 => False

T4:
* now = ~4.1
* window_end = 4.0
* drift = 1.3
* abs(self._window_end - now) - drift > self._config.resampling_period
  abs(             4.0 - 4.1) - 1.3   > 1.0  
                          0.1 - 1.3   > 1.0
                               -1.2   > 1.0 => False

I don't know, from this maybe it's enough to use abs(abs(we - now) - abs(drift)) > period, I have to think about it a bit more. But before doing that, now I really wonder why we are doing all this... 🤔

Don't we just want to always sync to the wall clock, so effectively make the resampler count periods in terms of the wall clock instead of the monotonic clock?

O maybe we just want to emit warnings if the wall clock is drifting? Having a drifting wall clock is really an operational issue, so I'm not sure if we should try to fix it in the code, as it will probably bring all sorts of other issues too.

And actually I'm not even sure we should warn about it in the SDK, probably we need some system-level monitoring for this instead. Is like if there were a disk failure and we try to warn the user about it and cope with it by reallocating stuff. I think is beyond the SDK's responsibilities.

It really looks like we are trying to patch a broken system here, and I'm not sure if it is a good idea...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I quickly drafted a solution that I think makes everything much easier. I just created a new timer type that is attached to the wall clock especially for the Resampler. We could then polish it and move it to the channels repo if we think it is generally useful. For now it doesn't have all the complexity of the missing ticks policies of the channels timer, it only implements the trigger all missing policy that's needed by the resampler.

Here is the proof of concept PR, against this PR:

The resampler gets much simplied with this, as it doesn't need to care about any time management anymore.

@llucax llucax modified the milestones: v1.0.0-rc4, v1.0.0-rc5 Jan 29, 2024
@llucax llucax modified the milestones: v1.0.0-rc5, v1.0.0-rc6 Feb 19, 2024
@llucax llucax modified the milestones: v1.0.0-rc6, v1.0.0-rc7 Mar 26, 2024
@llucax llucax removed the priority:URGENT Address this immediately or the sky will fall label May 31, 2024
@llucax
Copy link
Contributor

llucax commented Jul 1, 2024

I just checked the status of my proposal PR, without rebasing to the current v1.x.x. It fails some tests, not sure if it also did originally and I didn't check back then or if something changed in the dependencies that makes it fail now, but in any case, it needs more work to get it to a mergeable state. I will give it a shot to try to figure out if they are easy fixes or if it is something that will require a considerable amount of time.

@llucax
Copy link
Contributor

llucax commented Jul 1, 2024

OK, I rebased it and checked the status a bit more and it was not complete (I thought it was, I'm wondering if I have some other branch somewhere with what it's missing and I thought I already had 🤔, but I can't find anything). I think the wall clock timer implementation was complete, but not the integration in the resampler.

So this still requires some work, at least if we are going for my solution. IIRC your solution was still susceptible to some corner cases and this is why I proposed the timer attached to the wall clock instead, but if I'm recalling incorrectly and you solution worked, maybe we can merge that if the bug is more pressing now, and leave the wall clock timer solution for the future.

Otherwise we need to weight again what's the priority for it, if I take on it I think it could take from a full-day (if everything goes well, so pretty slim chances) to a week (on a more realistic estimation considering I can hardly focus on one only topic for very long :P).

EDIT: I might have found the other branch, but it is too late now, I will check it out tomorrow.

@llucax
Copy link
Contributor

llucax commented Jul 3, 2024

OK, no, the other branch was even older. I was looking into this today a little bit more, and it will definitely take some time to clean up. Specially adapting tests (which are quite complicated already) and writing new ones I think it's what will take the most time, as it is a very complicated issue to think about, model and test, with these 2 separate time spaces (the monotonic, real time, and the wall clock, human convention time) that can make look time is either compressing or expanding, and how and when to resync on each case. And then, how this is applied to the resampler in terms of how we build resampling windows 🤯

Just getting to asses this took me a few hours.

@llucax
Copy link
Contributor

llucax commented Jul 8, 2024

Alternative PR (building on top of this one): #999

@matthias-wende-frequenz
Copy link
Contributor Author

closing in favor of #999

@llucax llucax modified the milestones: post-v1.0, Dropped Oct 21, 2024
@llucax llucax added the resolution:duplicate This issue or pull request already exists label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
part:data-pipeline Affects the data pipeline part:docs Affects the documentation part:tests Affects the unit, integration and performance (benchmarks) tests resolution:duplicate This issue or pull request already exists
Projects
Development

Successfully merging this pull request may close these issues.

3 participants