-
Notifications
You must be signed in to change notification settings - Fork 402
Disallow dual-sync-async persistence without restarting #3737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Disallow dual-sync-async persistence without restarting #3737
Conversation
TheBlueMatt
commented
Apr 15, 2025
👋 Thanks for assigning @valentinewallace as a reviewer! |
We may be able to remove the test |
4e41cb7
to
70f9409
Compare
Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore? |
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and | ||
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts: is there a way to enforce this with the compiler? And is it a goal of the project to continue supporting both sync + async persistence forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we could have some kind of flag you set on startup which indicates which persistence method you're using, but I'm sure really sure that it would be an issue in practice - we expect users to use either Persist
or some future AsyncPersist
which will handle this stuff for them, and I don't really see why someone would ever persist sometimes-sync-sometimes-async - either your disk is slow and you need async or you don't.
We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.
To clarify, by sync + async persistence I mean supporting either returning ChannelMonitorUpdateStatus::Completed
or ::InProgress
. It feels like we expect most production users to persist data asynchronously and then later call channel_monitor_updated
, so it could be reasonable to drop support for the ::Completed
option down the line. Seems like users that are the exception to that can just call channel_monitor_updated
immediately, as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I imagine many small nodes will still want the simplicity of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, seems worth exploring though. Bookmarking to discuss further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, maybe when we're super 100% on the stability of async persist? It wouldn't free up a lot of code though cause it's all the same pathway now anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A luxury API for a minority group of sync users may not be the best choice long term?
But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?
Hopefully we can get to super 100% soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?
Hmm? This PR is not requiring that.
fuzz/src/chanmon_consistency.rs
Outdated
let empty_node_a_ser = node_a.encode(); | ||
let empty_node_b_ser = node_b.encode(); | ||
let empty_node_c_ser = node_c.encode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These don't seem to be used anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, part of a later patch :)
70f9409
to
ddccbda
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3737 +/- ##
==========================================
+ Coverage 89.12% 90.26% +1.13%
==========================================
Files 156 158 +2
Lines 123514 134266 +10752
Branches 123514 134266 +10752
==========================================
+ Hits 110086 121195 +11109
+ Misses 10749 10536 -213
+ Partials 2679 2535 -144 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Yea, I imagine its a good bit of work, though. |
Needs a squash |
ddccbda
to
d009b39
Compare
Squashed with a minor comment fix: $ git diff-tree -U1 ddccbdabd d009b3918
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index c86eed77b..54b607d88 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2573,4 +2573,4 @@ where
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
- /// otherwise directly enforce this, we enforce it in debug builds here by storing which one is
- /// in use.
+ /// otherwise directly enforce this, we enforce it in non-test builds here by storing which one
+ /// is in use.
#[cfg(not(test))] |
These externalized tests seem to be failing:
|
In general, we don't expect users to persist `ChannelMonitor[Update]`s both synchronously and asynchronously for a single `ChannelManager` instance. If a user has implemented asynchronous persistence, they should generally always use that, as there is then no advantage to them to occasionally persist synchronously. Even still, in 920d96e we fixed some bugs related to such operation, and noted that "there isn't much cost to supporting it". Sadly, this is not true. Specifically, the dual-sync-async persistence flow is ill-defined and difficult to define in away that a user can realistically implement. Consider the case of a `ChannelMonitorUpdate` which is persisted asynchronously and while it is still being persisted a new `ChannelMonitorUpdate` is created. If the second `ChannelMonitorUpdate` is persisted synchronously, the `ChannelManager` will be left with a single pending `ChannelMonitorUpdate` which is not the latest. If we were to then restart, the latest copy of the `ChannelMonitor` would be that without any updates, but the `ChannelManager` has a pending `ChannelMonitorUpdate` for the next update, but not the one after that. The user would then have to handle the replayed `ChannelMonitorUpdate` and then find the second `ChannelMonitorUpdate` on disk and somehow know to replay that one as well. Further, we currently have a bug in handling this scenario as we'll complete all pending post-update actions when the second `ChannelMonitorUpdate` gets persisted synchronously, even though the first `ChannelMonitorUpdate` is still pending. While we could rather trivially fix these issues, addressing the larger API question above is difficult and as we don't anticipate this use-case being important, we just disable it here. Note that we continue to support it internally as some 39 tests rely on it. We do, however, remove test_sync_async_persist_doesnt_hang which was added specifically to test for this use-case, and we now do not support it. Issue highlighted by (changes to the) chanmon_consistency fuzz target (in the next commit).
cb6219d
d009b39
to
cb6219d
Compare
Oof, ugh, right. Squashed-pushed a fix: $ git diff-tree -U1 d009b3918 cb6219df6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 54b607d88..34751bfc9 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2575,3 +2575,3 @@ where
/// is in use.
- #[cfg(not(test))]
+ #[cfg(not(any(test, feature = "_externalize_tests")))]
monitor_update_type: AtomicUsize,
@@ -3321,3 +3321,3 @@ macro_rules! handle_new_monitor_update {
ChannelMonitorUpdateStatus::InProgress => {
- #[cfg(not(test))]
+ #[cfg(not(any(test, feature = "_externalize_tests")))]
if $self.monitor_update_type.swap(1, Ordering::Relaxed) == 2 {
@@ -3330,3 +3330,3 @@ macro_rules! handle_new_monitor_update {
ChannelMonitorUpdateStatus::Completed => {
- #[cfg(not(test))]
+ #[cfg(not(any(test, feature = "_externalize_tests")))]
if $self.monitor_update_type.swap(2, Ordering::Relaxed) == 1 {
@@ -3594,3 +3594,3 @@ where
- #[cfg(not(test))]
+ #[cfg(not(any(test, feature = "_externalize_tests")))]
monitor_update_type: AtomicUsize::new(0),
@@ -14767,3 +14767,3 @@ where
- #[cfg(not(test))]
+ #[cfg(not(any(test, feature = "_externalize_tests")))]
monitor_update_type: AtomicUsize::new(0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering whether the problem where chan mgr will be left with a single pending update which is not the latest can also occur when purely async is used, but channel_monitor_updated
are called in the reverse order?
|
cb6219d
to
3805ca1
Compare
LGTM after squash |
Re-dropped the constant after discussion. |
3805ca1
to
90aec07
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just posted that drive-by comment, but was happy to review the full PR as requested as it is context that I probably need for my own future work too.
Just see what you want to do with the comments, as this PR was already in the later stages with two reviewers on it already.
for (channel_id, monitor) in monitors.iter() { | ||
monitor_refs.insert(*channel_id, monitor); | ||
} | ||
let reload_node = |ser: &Vec<u8>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do this as the first commit to reduce total delta? I think it's better for review when refactors come before functional changes in a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really worth changing? Indeed, its generally nicer, but its three extra lines of diff and I wrote it the other way originally, not sure its worth bothering to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not blocking.
fuzz/src/chanmon_consistency.rs
Outdated
@@ -680,8 +679,8 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
let mon_style = [default_mon_style.clone(), default_mon_style.clone(), default_mon_style]; | |||
|
|||
macro_rules! reload_node { | |||
($ser: expr, $node_id: expr, $old_monitors: expr, $keys_manager: expr, $fee_estimator: expr) => {{ | |||
let keys_manager = Arc::clone(&$keys_manager); | |||
($ser: expr, $node_id: expr, $old_monitors: expr, $use_old_mons: expr, $keys: expr, $fee_estimator: expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_old_mons could be an enum for readability and then match 0x2a | 0x2b | 0x2c etc directly to enum values. Makes it all a bit more explicit with fewer explanation about +1 +2 required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and broke backwards compat so I don't think this is worth it anymore (its always just passing v
).
fuzz/src/chanmon_consistency.rs
Outdated
@@ -1514,12 +1524,16 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
ab_events.clear(); | |||
ba_events.clear(); | |||
} | |||
// Note that we ensure 0x2c represents "use oldest monitor" to retain backwards | |||
// compatibility with existing fuzz corpuses by using setting (v + 1) % 3 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one to keep in mind existing corpuses.
fuzz/src/chanmon_consistency.rs
Outdated
($ser: expr, $node_id: expr, $old_monitors: expr, $keys_manager: expr, $fee_estimator: expr) => {{ | ||
let keys_manager = Arc::clone(&$keys_manager); | ||
($ser: expr, $node_id: expr, $old_monitors: expr, $use_old_mons: expr, $keys: expr, $fee_estimator: expr) => {{ | ||
let keys_manager = Arc::clone(&$keys); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the arc/clone/keys thing is an unrelated refactor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? Those already existed (the only change is that the fee_estimator
clone was changed to Arc::clone
rather than the implicit clone
for consistency, but its calling the same method).
@@ -674,6 +676,9 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
}}; | |||
} | |||
|
|||
let default_mon_style = RefCell::new(ChannelMonitorUpdateStatus::Completed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the changes to this file a preparation for the next commit somehow and/or are there also independent refactors in here? I struggle a bit to see the link with the chanmgr change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its because we're no longer allowed to switch monitor connection styles mid-stream. So instead we're changing to storing the monitor style we'll start applying on the next node reload here, and applying it only on reload.
break; | ||
if last_pass_no_updates { | ||
// In some cases, we may generate a message to send in | ||
// `process_msg_events`, but block sending until |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pre-existing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because in the past we would switch to sync persist before we get here, but now we have to keep it async.
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and | ||
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A luxury API for a minority group of sync users may not be the best choice long term?
But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?
Hopefully we can get to super 100% soon.
When we reload a node in the `chanmon_consistency` fuzzer, we always reload with the latest `ChannelMonitor` state which was confirmed as persisted to the running `ChannelManager`. This is nice in that it tests losing the latest `ChannelMonitor`, but there may also be bugs in the on-startup `ChannelMonitor` replay. Thus, here, we optionally reload with a newer `ChannelMonitor` than the last-persisted one. Note that this breaks backwards compat for existing `chanmon_consistency` tests, requiring that 0x2c bytes be replaced with 0xb1, 0x2d with 0xb4 and 0x2e with 0xbd.
Since we broke backwards compat and changed the values of bytes for reloading nodes in the `chanmon_consistency` fuzzer, here we move them down to the right position.
`chanmon_consistency` was originally written with lots of macros due to some misguided concept of code being unrolled at compile-time. This is, of course, a terrible idea not just for compile times but also for performance. Here, we make `reload_node` a function in anticipation of it being used in additional places in future work.
90aec07
to
d27c161
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good with a squash
} else if use_old_mons % 3 == 1 { | ||
// Reload with the second-oldest `ChannelMonitor` | ||
let old_mon = prev_state.persisted_monitor; | ||
prev_state.pending_monitors.drain(..).next().map(|(_, v)| v).unwrap_or(old_mon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you comment why we're emptying pending_monitors
rather than just removing the first one? We don't clear the vec below so it reads a bit inconsistent to me
// Note that we ensure 0x2d represents "use oldest monitor" to retain backwards | ||
// compatibility with existing fuzz corpuses by using setting v % 3 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outdated now