-
Notifications
You must be signed in to change notification settings - Fork 17
Resync resampler on system time changes #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resync resampler on system time changes #802
Conversation
We can discuss getting pulling this in and add a test shortly after in a separate PR, to speed up the patch release. What do you think? |
What's the plan here? If the plan is that this is tested in production (or somewhere else), any reason not to just temporarily point the SDK dependency to your fork's branch (https://github.com/matthias-wende-frequenz/frequenz-sdk-python/tree/fix_resampling)? I'd be OK to merge first and do the test afterwards if there is no other way around, but if we can do whatever we need without accepting PRs without tests, I think it makes more sense. |
I was thinking that we can unblock the fix. But I feel it's worth adding the tests before merging. Also the tests are almost ready. Will add them to this PR soon. |
e66b651
to
c84327b
Compare
c84327b
to
c4bb046
Compare
d64c7c2
to
f0ecc06
Compare
Signed-off-by: Matthias Wende <[email protected]>
Adds a boolean parameter if an extra period should be added to the calculated window end or not. Signed-off-by: Matthias Wende <[email protected]>
Signed-off-by: Matthias Wende <[email protected]>
f0ecc06
to
9600ba7
Compare
Signed-off-by: Matthias Wende <[email protected]>
Signed-off-by: Matthias Wende <[email protected]>
Signed-off-by: Matthias Wende <[email protected]>
9600ba7
to
4cc5902
Compare
@@ -457,6 +449,34 @@ def remove_timeseries(self, source: Source) -> bool: | |||
return False | |||
return True | |||
|
|||
def _sync_timer(self, extra_period: bool = True) -> datetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly, but I would either remove this extra period, as is it only a startup glitch we are trying to fix (I guess it made sense in the original code because it was free), or to switch the default, as the normal situation seems to be not adding an extra period.
# new samples aligned to the new system time have been received. | ||
# Thus we resync `self._window_end` to the new system time in case | ||
# it drifted more then one resampling period away from the system time. | ||
if abs(self._window_end - now) - drift > self._config.resampling_period: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly this is still not enough. Both drifts need to be handled independently, the mixing of both could end up hiding issues. For example:
monotonic clock
0 1 2 3 4 5 6
|---------|---------|---------|---------|---------|---------|---------> time
o o o o o o
t1 t2 t3 t4 t5 expected ticks
T1 T2 T3 T4 observed ticks
O * * * *
|------------|------------|------------|------------|------------|-------> time
0 1 2 3 4 5
wall clock
resampling_period = 1.0, and the wall clock has an accumulating drift of 0.3s per second.
T1:
* now = ~0.9
* window_end = 1.0
* drift = 0.2
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 1.0 - 0.9) - 0.2 > 1.0
0.1 - 0.2 > 1.0
-0.1 > 1.0 => False
T2:
* now = ~1.7
* window_end = 2.0
* drift = 0.1
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 2.0 - 1.7) - 0.1 > 1.0
0.3 - 0.1 > 1.0
0.2 > 1.0 => False
T3:
* now = ~2.4
* window_end = 3.0
* drift = 0
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 3.0 - 2.4) - 0.0 > 1.0
0.6 - 0.0 > 1.0
0.6 > 1.0 => False
T4:
* now = ~4.1
* window_end = 4.0
* drift = 1.3
* abs(self._window_end - now) - drift > self._config.resampling_period
abs( 4.0 - 4.1) - 1.3 > 1.0
0.1 - 1.3 > 1.0
-1.2 > 1.0 => False
I don't know, from this maybe it's enough to use abs(abs(we - now) - abs(drift)) > period
, I have to think about it a bit more. But before doing that, now I really wonder why we are doing all this... 🤔
Don't we just want to always sync to the wall clock, so effectively make the resampler count periods in terms of the wall clock instead of the monotonic clock?
O maybe we just want to emit warnings if the wall clock is drifting? Having a drifting wall clock is really an operational issue, so I'm not sure if we should try to fix it in the code, as it will probably bring all sorts of other issues too.
And actually I'm not even sure we should warn about it in the SDK, probably we need some system-level monitoring for this instead. Is like if there were a disk failure and we try to warn the user about it and cope with it by reallocating stuff. I think is beyond the SDK's responsibilities.
It really looks like we are trying to patch a broken system here, and I'm not sure if it is a good idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I quickly drafted a solution that I think makes everything much easier. I just created a new timer type that is attached to the wall clock especially for the Resampler
. We could then polish it and move it to the channels repo if we think it is generally useful. For now it doesn't have all the complexity of the missing ticks policies of the channels timer, it only implements the trigger all missing policy that's needed by the resampler.
Here is the proof of concept PR, against this PR:
The resampler gets much simplied with this, as it doesn't need to care about any time management anymore.
I just checked the status of my proposal PR, without rebasing to the current v1.x.x. It fails some tests, not sure if it also did originally and I didn't check back then or if something changed in the dependencies that makes it fail now, but in any case, it needs more work to get it to a mergeable state. I will give it a shot to try to figure out if they are easy fixes or if it is something that will require a considerable amount of time. |
OK, I rebased it and checked the status a bit more and it was not complete (I thought it was, I'm wondering if I have some other branch somewhere with what it's missing and I thought I already had 🤔, but I can't find anything). I think the wall clock timer implementation was complete, but not the integration in the resampler. So this still requires some work, at least if we are going for my solution. IIRC your solution was still susceptible to some corner cases and this is why I proposed the timer attached to the wall clock instead, but if I'm recalling incorrectly and you solution worked, maybe we can merge that if the bug is more pressing now, and leave the wall clock timer solution for the future. Otherwise we need to weight again what's the priority for it, if I take on it I think it could take from a full-day (if everything goes well, so pretty slim chances) to a week (on a more realistic estimation considering I can hardly focus on one only topic for very long :P). EDIT: I might have found the other branch, but it is too late now, I will check it out tomorrow. |
OK, no, the other branch was even older. I was looking into this today a little bit more, and it will definitely take some time to clean up. Specially adapting tests (which are quite complicated already) and writing new ones I think it's what will take the most time, as it is a very complicated issue to think about, model and test, with these 2 separate time spaces (the monotonic, real time, and the wall clock, human convention time) that can make look time is either compressing or expanding, and how and when to resync on each case. And then, how this is applied to the resampler in terms of how we build resampling windows 🤯 Just getting to asses this took me a few hours. |
Alternative PR (building on top of this one): #999 |
closing in favor of #999 |
The resampler was initializing an internal time on initialization by syncing itself to the system clock.
This sync was never redone and thus lead to out-of-sync issues when the system clock changed, which could cause that the resampling window was empty even when samples, with the new system time, arrived.