Skip to content

Disallow dual-sync-async persistence without restarting #3737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

TheBlueMatt
Copy link
Collaborator

In general, we don't expect users to persist
`ChannelMonitor[Update]`s both synchronously and asynchronously for
a single `ChannelManager` instance. If a user has implemented
asynchronous persistence, they should generally always use that,
as there is then no advantage to them to occasionally persist
synchronously.

Even still, in 920d96edb6595289902f287419de2d002e2dc2ee we fixed
some bugs related to such operation, and noted that "there isn't
much cost to supporting it". Sadly, this is not true.

Specifically, the dual-sync-async persistence flow is ill-defined
and difficult to define in away that a user can realistically
implement.

Consider the case of a `ChannelMonitorUpdate` which is persisted
asynchronously and while it is still being persisted a new
`ChannelMonitorUpdate` is created. If the second
`ChannelMonitorUpdate` is persisted synchronously, the
`ChannelManager` will be left with a single pending
`ChannelMonitorUpdate` which is not the latest.

If we were to then restart, the latest copy of the `ChannelMonitor`
would be that without any updates, but the `ChannelManager` has a
pending `ChannelMonitorUpdate` for the next update, but not the one
after that. The user would then have to handle the replayed
`ChannelMonitorUpdate` and then find the second
`ChannelMonitorUpdate` on disk and somehow know to replay that one
as well.

Further, we currently have a bug in handling this scenario as we'll
complete all pending post-update actions when the second
`ChannelMonitorUpdate` gets persisted synchronously, even though
the first `ChannelMonitorUpdate` is still pending. While we could
rather trivially fix these issues, addressing the larger API
question above is difficult and as we don't anticipate this
use-case being important, we just disable it here.

Note that we continue to support it internally as some 39 tests
rely on it.

Issue highlighted by (changes to the) chanmon_consistency fuzz
target (in the next commit).

@TheBlueMatt TheBlueMatt added this to the 0.2 milestone Apr 15, 2025
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Apr 15, 2025

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@valentinewallace
Copy link
Contributor

We may be able to remove the test test_sync_async_persist_doesnt_hang since it specifically covers async + sync persistence.

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 4e41cb7 to 70f9409 Compare April 16, 2025 14:37
@valentinewallace
Copy link
Contributor

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

Comment on lines +2572 to +2573
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts: is there a way to enforce this with the compiler? And is it a goal of the project to continue supporting both sync + async persistence forever?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could have some kind of flag you set on startup which indicates which persistence method you're using, but I'm sure really sure that it would be an issue in practice - we expect users to use either Persist or some future AsyncPersist which will handle this stuff for them, and I don't really see why someone would ever persist sometimes-sync-sometimes-async - either your disk is slow and you need async or you don't.

We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.

To clarify, by sync + async persistence I mean supporting either returning ChannelMonitorUpdateStatus::Completed or ::InProgress. It feels like we expect most production users to persist data asynchronously and then later call channel_monitor_updated, so it could be reasonable to drop support for the ::Completed option down the line. Seems like users that are the exception to that can just call channel_monitor_updated immediately, as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I imagine many small nodes will still want the simplicity of sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, seems worth exploring though. Bookmarking to discuss further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, maybe when we're super 100% on the stability of async persist? It wouldn't free up a lot of code though cause it's all the same pathway now anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A luxury API for a minority group of sync users may not be the best choice long term?

But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?

Hopefully we can get to super 100% soon.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?

Hmm? This PR is not requiring that.

Comment on lines 968 to 970
let empty_node_a_ser = node_a.encode();
let empty_node_b_ser = node_b.encode();
let empty_node_c_ser = node_c.encode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't seem to be used anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, part of a later patch :)

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 70f9409 to ddccbda Compare April 16, 2025 19:58
Copy link

codecov bot commented Apr 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.26%. Comparing base (83e9e80) to head (cb6219d).
Report is 60 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3737      +/-   ##
==========================================
+ Coverage   89.12%   90.26%   +1.13%     
==========================================
  Files         156      158       +2     
  Lines      123514   134266   +10752     
  Branches   123514   134266   +10752     
==========================================
+ Hits       110086   121195   +11109     
+ Misses      10749    10536     -213     
+ Partials     2679     2535     -144     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheBlueMatt
Copy link
Collaborator Author

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

Yea, I imagine its a good bit of work, though.

@wpaulino
Copy link
Contributor

Needs a squash

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from ddccbda to d009b39 Compare April 29, 2025 13:42
@TheBlueMatt
Copy link
Collaborator Author

Squashed with a minor comment fix:

$ git diff-tree -U1 ddccbdabd d009b3918
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index c86eed77b..54b607d88 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2573,4 +2573,4 @@ where
 	/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
-	/// otherwise directly enforce this, we enforce it in debug builds here by storing which one is
-	/// in use.
+	/// otherwise directly enforce this, we enforce it in non-test builds here by storing which one
+	/// is in use.
 	#[cfg(not(test))]

wpaulino
wpaulino previously approved these changes Apr 29, 2025
@wpaulino
Copy link
Contributor

These externalized tests seem to be failing:

- test_batch_funding_close_after_funding_signed
- test_batch_channel_open

In general, we don't expect users to persist
`ChannelMonitor[Update]`s both synchronously and asynchronously for
a single `ChannelManager` instance. If a user has implemented
asynchronous persistence, they should generally always use that,
as there is then no advantage to them to occasionally persist
synchronously.

Even still, in 920d96e we fixed
some bugs related to such operation, and noted that "there isn't
much cost to supporting it". Sadly, this is not true.

Specifically, the dual-sync-async persistence flow is ill-defined
and difficult to define in away that a user can realistically
implement.

Consider the case of a `ChannelMonitorUpdate` which is persisted
asynchronously and while it is still being persisted a new
`ChannelMonitorUpdate` is created. If the second
`ChannelMonitorUpdate` is persisted synchronously, the
`ChannelManager` will be left with a single pending
`ChannelMonitorUpdate` which is not the latest.

If we were to then restart, the latest copy of the `ChannelMonitor`
would be that without any updates, but the `ChannelManager` has a
pending `ChannelMonitorUpdate` for the next update, but not the one
after that. The user would then have to handle the replayed
`ChannelMonitorUpdate` and then find the second
`ChannelMonitorUpdate` on disk and somehow know to replay that one
as well.

Further, we currently have a bug in handling this scenario as we'll
complete all pending post-update actions when the second
`ChannelMonitorUpdate` gets persisted synchronously, even though
the first `ChannelMonitorUpdate` is still pending. While we could
rather trivially fix these issues, addressing the larger API
question above is difficult and as we don't anticipate this
use-case being important, we just disable it here.

Note that we continue to support it internally as some 39 tests
rely on it. We do, however, remove
test_sync_async_persist_doesnt_hang which was added specifically to
test for this use-case, and we now do not support it.

Issue highlighted by (changes to the) chanmon_consistency fuzz
target (in the next commit).
@TheBlueMatt TheBlueMatt dismissed stale reviews from wpaulino and valentinewallace via cb6219d April 30, 2025 01:21
@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from d009b39 to cb6219d Compare April 30, 2025 01:21
@TheBlueMatt
Copy link
Collaborator Author

Oof, ugh, right. Squashed-pushed a fix:

$ git diff-tree -U1 d009b3918 cb6219df6
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index 54b607d88..34751bfc9 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2575,3 +2575,3 @@ where
 	/// is in use.
-	#[cfg(not(test))]
+	#[cfg(not(any(test, feature = "_externalize_tests")))]
 	monitor_update_type: AtomicUsize,
@@ -3321,3 +3321,3 @@ macro_rules! handle_new_monitor_update {
 			ChannelMonitorUpdateStatus::InProgress => {
-				#[cfg(not(test))]
+				#[cfg(not(any(test, feature = "_externalize_tests")))]
 				if $self.monitor_update_type.swap(1, Ordering::Relaxed) == 2 {
@@ -3330,3 +3330,3 @@ macro_rules! handle_new_monitor_update {
 			ChannelMonitorUpdateStatus::Completed => {
-				#[cfg(not(test))]
+				#[cfg(not(any(test, feature = "_externalize_tests")))]
 				if $self.monitor_update_type.swap(2, Ordering::Relaxed) == 1 {
@@ -3594,3 +3594,3 @@ where

-			#[cfg(not(test))]
+			#[cfg(not(any(test, feature = "_externalize_tests")))]
 			monitor_update_type: AtomicUsize::new(0),
@@ -14767,3 +14767,3 @@ where

-			#[cfg(not(test))]
+			#[cfg(not(any(test, feature = "_externalize_tests")))]
 			monitor_update_type: AtomicUsize::new(0),

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether the problem where chan mgr will be left with a single pending update which is not the latest can also occur when purely async is used, but channel_monitor_updated are called in the reverse order?

@TheBlueMatt
Copy link
Collaborator Author

Wondering whether the problem where chan mgr will be left with a single pending update which is not the latest can also occur when purely async is used, but channel_monitor_updated are called in the reverse order?

MonitorEvent::Completed is only allowed to signal "all updates to and through the given id have completed" to avoid this case. ChainMonitor tracks this and generates the right MonitorEvetns.

wpaulino
wpaulino previously approved these changes Apr 30, 2025
@valentinewallace
Copy link
Contributor

LGTM after squash

@TheBlueMatt
Copy link
Collaborator Author

Re-dropped the constant after discussion.

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 3805ca1 to 90aec07 Compare May 2, 2025 00:35
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just posted that drive-by comment, but was happy to review the full PR as requested as it is context that I probably need for my own future work too.

Just see what you want to do with the comments, as this PR was already in the later stages with two reviewers on it already.

for (channel_id, monitor) in monitors.iter() {
monitor_refs.insert(*channel_id, monitor);
}
let reload_node = |ser: &Vec<u8>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do this as the first commit to reduce total delta? I think it's better for review when refactors come before functional changes in a PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really worth changing? Indeed, its generally nicer, but its three extra lines of diff and I wrote it the other way originally, not sure its worth bothering to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not blocking.

@@ -680,8 +679,8 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
let mon_style = [default_mon_style.clone(), default_mon_style.clone(), default_mon_style];

macro_rules! reload_node {
($ser: expr, $node_id: expr, $old_monitors: expr, $keys_manager: expr, $fee_estimator: expr) => {{
let keys_manager = Arc::clone(&$keys_manager);
($ser: expr, $node_id: expr, $old_monitors: expr, $use_old_mons: expr, $keys: expr, $fee_estimator: expr) => {{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_old_mons could be an enum for readability and then match 0x2a | 0x2b | 0x2c etc directly to enum values. Makes it all a bit more explicit with fewer explanation about +1 +2 required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and broke backwards compat so I don't think this is worth it anymore (its always just passing v).

@@ -1514,12 +1524,16 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
ab_events.clear();
ba_events.clear();
}
// Note that we ensure 0x2c represents "use oldest monitor" to retain backwards
// compatibility with existing fuzz corpuses by using setting (v + 1) % 3 == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one to keep in mind existing corpuses.

($ser: expr, $node_id: expr, $old_monitors: expr, $keys_manager: expr, $fee_estimator: expr) => {{
let keys_manager = Arc::clone(&$keys_manager);
($ser: expr, $node_id: expr, $old_monitors: expr, $use_old_mons: expr, $keys: expr, $fee_estimator: expr) => {{
let keys_manager = Arc::clone(&$keys);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the arc/clone/keys thing is an unrelated refactor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm? Those already existed (the only change is that the fee_estimator clone was changed to Arc::clone rather than the implicit clone for consistency, but its calling the same method).

@@ -674,6 +676,9 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
}};
}

let default_mon_style = RefCell::new(ChannelMonitorUpdateStatus::Completed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the changes to this file a preparation for the next commit somehow and/or are there also independent refactors in here? I struggle a bit to see the link with the chanmgr change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its because we're no longer allowed to switch monitor connection styles mid-stream. So instead we're changing to storing the monitor style we'll start applying on the next node reload here, and applying it only on reload.

break;
if last_pass_no_updates {
// In some cases, we may generate a message to send in
// `process_msg_events`, but block sending until
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pre-existing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because in the past we would switch to sync persist before we get here, but now we have to keep it async.

Comment on lines +2572 to +2573
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A luxury API for a minority group of sync users may not be the best choice long term?

But if sync users would need to use async with an immediate callback now, I think it is no longer so clear that async is not production ready?

Hopefully we can get to super 100% soon.

When we reload a node in the `chanmon_consistency` fuzzer, we
always reload with the latest `ChannelMonitor` state which was
confirmed as persisted to the running `ChannelManager`. This is
nice in that it tests losing the latest `ChannelMonitor`, but there
may also be bugs in the on-startup `ChannelMonitor` replay.

Thus, here, we optionally reload with a newer `ChannelMonitor` than
the last-persisted one.

Note that this breaks backwards compat for existing
`chanmon_consistency` tests, requiring that 0x2c bytes be replaced
with 0xb1, 0x2d with 0xb4 and 0x2e with 0xbd.
Since we broke backwards compat and changed the values of bytes
for reloading nodes in the `chanmon_consistency` fuzzer, here we
move them down to the right position.
`chanmon_consistency` was originally written with lots of macros
due to some misguided concept of code being unrolled at
compile-time. This is, of course, a terrible idea not just for
compile times but also for performance.

Here, we make `reload_node` a function in anticipation of it being
used in additional places in future work.
@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 90aec07 to d27c161 Compare May 2, 2025 13:57
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with a squash

} else if use_old_mons % 3 == 1 {
// Reload with the second-oldest `ChannelMonitor`
let old_mon = prev_state.persisted_monitor;
prev_state.pending_monitors.drain(..).next().map(|(_, v)| v).unwrap_or(old_mon)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment why we're emptying pending_monitors rather than just removing the first one? We don't clear the vec below so it reads a bit inconsistent to me

Comment on lines +1698 to +1699
// Note that we ensure 0x2d represents "use oldest monitor" to retain backwards
// compatibility with existing fuzz corpuses by using setting v % 3 == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants