Block monitor updates to ensure preimages are in each MPP part #3120

TheBlueMatt · 2024-06-13T01:27:44Z

If we claim an MPP payment and only persist some of the
ChannelMonitorUpdates which include the preimage prior to
shutting down, we may be in a state where some of our
ChannelMonitors have the preimage for a payment while others do
not.

This, it turns out, is actually mostly safe - on startup
ChanelManager will detect there's a payment it has as unclaimed
but there's a corresponding payment preimage in a ChannelMonitor
and go claim the other MPP parts. This works so long as the
ChannelManager has been persisted after the payment has been
received but prior to the PaymentClaimable event being processed
(and the claim itself occurring). This is not always true and
certainly not required on our API, but our
lightning-background-processor today does persist prior to event
handling so is generally true subject to some race conditions.

In order to address this we need to use copy payment preimages
across channels irrespective of the ChannelManager's payment
state, but this introduces another wrinkle - if one channel makes
substantial progress while other channel(s) are still waiting to
get the payment preimage in ChannelMonitor(s) while the
ChannelManager hasn't been persisted after the payment was
received, we may end up without the preimage on disk.

Here, we address this issue by using the new
RAAMonitorUpdateBlockingAction variant for this specific case. We
block persistence of an RAA ChannelMonitorUpdate which may remove
the preimage from disk until all channels have had the preimage
added to their ChannelMonitor.

We do this only in-memory (and not on disk) as we can recreate this
blocker during the startup re-claim logic.

This will enable us to claim MPP parts without using the
ChannelManager's payment state in later work.

codecov-commenter · 2024-06-13T01:38:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 89.68750% with 33 lines in your changes missing coverage. Please review.

Project coverage is 90.54%. Comparing base (88e1b56) to head (2d1306a).
Report is 25 commits behind head on main.

Files	Patch %	Lines
lightning/src/ln/channelmanager.rs	85.90%	22 Missing and 9 partials ⚠️
lightning/src/ln/chanmon_update_fail_tests.rs	97.95%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3120      +/-   ##
==========================================
+ Coverage   89.83%   90.54%   +0.71%     
==========================================
  Files         121      121              
  Lines       98900   106185    +7285     
  Branches    98900   106185    +7285     
==========================================
+ Hits        88847    96150    +7303     
+ Misses       7457     7419      -38     
- Partials     2596     2616      +20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jkczyz

Need to take a more thorough look at the last two commits still.

lightning/src/ln/channelmanager.rs

jkczyz · 2024-06-17T22:08:46Z

lightning/src/ln/channelmanager.rs

+					htlc_id: htlc_id.0.unwrap(),
+				}))
+			},
+			// 1 is ClaimedMPPPayment and is handled in the general odd handling below


Should we expand impl_writeable_tlv_based_enum_upgradable with an optional section for variants to ignore instead of implementing this by hand?

Mmm, that wasn't as hard as i thought.

valentinewallace · 2024-06-18T20:43:59Z

Needs rebase + not building for me locally

TheBlueMatt · 2024-06-18T23:50:20Z

Rebased and fixed the issue.

lightning/src/ln/chanmon_update_fail_tests.rs

lightning/src/ln/channelmanager.rs

jkczyz

Sorry for being slow on this. I'm not as familiar with this code. Still need to look at the test.

lightning/src/ln/channelmanager.rs

jkczyz · 2024-06-20T22:57:47Z

lightning/src/ln/channelmanager.rs

+					if let Some((counterparty_node_id, chan_id, htlc_id, claim_ptr)) = pending_mpp_claim {
+						let per_peer_state = self.per_peer_state.read().unwrap();
+						per_peer_state.get(&counterparty_node_id).map(|peer_state_mutex| {
+							let mut peer_state = peer_state_mutex.lock().unwrap();
+							let blockers_entry = peer_state.actions_blocking_raa_monitor_updates.entry(chan_id);
+							if let btree_map::Entry::Occupied(mut blockers) = blockers_entry {
+								blockers.get_mut().retain(|blocker|
+									if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker {
+										if *pending_claim == claim_ptr {
+											let mut pending_claim_state_lock = pending_claim.0.lock().unwrap();
+											let pending_claim_state = &mut *pending_claim_state_lock;
+											pending_claim_state.channels_without_preimage.retain(|(cp, outp, cid, hid)| {
+												if *cp == counterparty_node_id && *cid == chan_id && *hid == htlc_id {
+													pending_claim_state.channels_with_preimage.push((*cp, *outp, *cid));
+													false
+												} else { true }
+											});
+											if pending_claim_state.channels_without_preimage.is_empty() {
+												for (cp, outp, cid) in pending_claim_state.channels_with_preimage.iter() {
+													freed_channels.push((*cp, *outp, *cid, blocker.clone()));
+												}
+											}
+											!pending_claim_state.channels_without_preimage.is_empty()
+										} else { true }
+									} else { true }
+								);
+								if blockers.get().is_empty() {
+									blockers.remove();
+								}
+							}
+						});
+					}


To avoid rustfmt from mangling this later, might be worth turning this into some chained calls. Might flatten it a bit?

Hmm, I don't actually see much room to flatten it :/ Most of the ifs are pattern-matching, not just checking and we have more logic than can just be done with a filter :/

jkczyz · 2024-06-20T22:59:47Z

lightning/src/ln/channelmanager.rs

+				let peer_state_mutex = per_peer_state.entry(counterparty_node_id).or_insert_with(||
+					Mutex::new(PeerState {
+						channel_by_id: new_hash_map(),
+						inbound_channel_request_by_id: new_hash_map(),
+						latest_features: InitFeatures::empty(),
+						pending_msg_events: Vec::new(),
+						in_flight_monitor_updates: BTreeMap::new(),
+						monitor_update_blocked_actions: BTreeMap::new(),
+						actions_blocking_raa_monitor_updates: BTreeMap::new(),
+						is_connected: false,
+					}));
+				let mut peer_state = peer_state_mutex.lock().unwrap();


For my understanding, in what case would we not have a PeerState for a counterparty?

If we're disconnected from the peer and we have no more channels, we may have pruned the PeerState entry.

valentinewallace

Sorry for the delay

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2024-06-25T17:48:41Z

Rebased and addressed comments.

valentinewallace · 2024-06-27T13:41:19Z

lightning/src/ln/channelmanager.rs

 	// Note that FreeOtherChannelImmediately should never be written - we were supposed to free
 	// *immediately*. However, for simplicity we implement read/write here.
 	(1, FreeOtherChannelImmediately) => {
 		(0, downstream_counterparty_node_id, required),
 		(2, downstream_funding_outpoint, required),
-		(4, blocking_action, required),
+		(4, blocking_action, upgradable_required),


Can we use unread_variants for this now?

Hmm, maybe? Note that this isn't an encoding change, anyway - it just can sometimes short-circuit and return None for the whole enum.

valentinewallace · 2024-06-27T14:38:22Z

lightning/src/ln/channelmanager.rs

+			let (source_claim_pairs, pending_claim_ptr_opt) = if sources.len() > 1 {
+				let mut pending_claims = PendingMPPClaim {
+					channels_without_preimage: Vec::new(),
+					channels_with_preimage: Vec::new(),
+				};
+				let (source_claim_pairs, channels_without_preimage) = sources.into_iter().filter_map(|htlc| {
+					if let Some(cp_id) = htlc.prev_hop.counterparty_node_id {
+						let htlc_id = htlc.prev_hop.htlc_id;
+						let chan_id = htlc.prev_hop.channel_id;
+						let chan_outpoint = htlc.prev_hop.outpoint;
+						Some((
+							(htlc, Some((cp_id, chan_id, htlc_id))),
+							(cp_id, chan_outpoint, chan_id, htlc_id),
+						))
+					} else {
+						None
+					}
+				}).unzip::<_, _, Vec<_>, _>();
+				pending_claims.channels_without_preimage = channels_without_preimage;
+				(source_claim_pairs, Some(Arc::new(Mutex::new(pending_claims))))
+			} else {
+				(sources.into_iter().map(|htlc| (htlc, None)).collect(), None)
+			};
+			for (htlc, mpp_claim) in source_claim_pairs {
+				let mut pending_mpp_claim = None;
+				let pending_claim_ptr = pending_claim_ptr_opt.as_ref().map(|pending_claim| {
+					pending_mpp_claim = mpp_claim.map(|(cp_id, chan_id, htlc_id)|
+						(cp_id, chan_id, htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim)))
+					);
+					RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
+						pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),
+					}
+				});
 				self.claim_funds_from_hop(
 					htlc.prev_hop, payment_preimage,
 					|_, definitely_duplicate| {
 						debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment");
-						Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash })
+						(Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), pending_claim_ptr)


I find writing it this way a lot less confusing rather than using the intermediate source_claim_pairs, getting rid of the remaining muts and some renames:

let pending_mpp_claim_opt = if sources.len() > 1 { let channels_without_preimage = sources.iter().filter_map(|htlc| { htlc.prev_hop.counterparty_node_id.map(|cp_id| { (cp_id, htlc.prev_hop.outpoint, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id) }) }).collect(); Some(Arc::new(Mutex::new(PendingMPPClaim { channels_without_preimage, channels_with_preimage: Vec::new() }))) } else { None }; for htlc in sources.into_iter() { let raa_blocker = pending_mpp_claim_opt.as_ref().map(|pending_claim| { RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)), } }); let pending_mpp_claim = pending_mpp_claim_opt.as_ref().map(|pending_claim| { (htlc.prev_hop.counterparty_node_id.expect("We checked this"), htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim))) }); self.claim_funds_from_hop( htlc.prev_hop, payment_preimage, |_, definitely_duplicate| { debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment"); (Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), raa_blocker) } ); }

Thanks, indeed that is much more readable. I further tweaked the variable names slightly.

valentinewallace · 2024-06-27T15:38:20Z

LGTM after outstanding feedback and Jeff takes another look. Would be good with a squash.

jkczyz

Yeah, feel free to squash. Looks good aside from some minor comments.

lightning/src/ln/channelmanager.rs

jkczyz · 2024-07-08T18:25:39Z

lightning/src/ln/channelmanager.rs

+								blockers.get_mut().retain(|blocker|
+									if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker {
+										if *pending_claim == claim_ptr {
+											let mut pending_claim_state_lock = pending_claim.0.lock().unwrap();


Are there any lock order concerns? I guess the nesting defines what is allowable.

Hmm? I'm not sure I see any lockorder concerns here, we always take the new inner lock as the last lock, which is always safe. More generally, our lockorder violation detection is quite good, so as long as we hit a lockorder inversion once in tests we should fail.

jkczyz · 2024-07-08T18:38:52Z

lightning/src/ln/channelmanager.rs

@@ -6324,7 +6361,9 @@ where
 		}
 	}

-	fn claim_funds_from_hop<ComplFunc: FnOnce(Option<u64>, bool) -> Option<MonitorUpdateCompletionAction>>(
+	fn claim_funds_from_hop<
+		ComplFunc: FnOnce(Option<u64>, bool) -> (Option<MonitorUpdateCompletionAction>, Option<RAAMonitorUpdateBlockingAction>)


Looks like (None, Some) is not possible. Should we consider a different nesting or possibly a three-variant enum?

Hmm, good catch, not sure it matters though? It complicates the code in the on-chain-claim branch a tiny bit, but more importantly they're logically distinct concepts, and there's nothing wrong with there being an RAA blocker without an event, we just don't have a use for it atm.

In some cases, we have variants of an enum serialized using `impl_writeable_tlv_based_enum_upgradable` which we don't want to write/read. Here we add support for such variants by writing them using the (odd) type 255 without any contents and using `MaybeReadable` to decline to read them.

Because we track pending `ChannelMonitorUpdate`s per-peer, we really need to know what peer an HTLC came from when we go to claim it on-chain, allowing us to look up any blocked actions in the `PeerState`. While we could do this by moving the blocked actions to some global "closed-channel update queue", its simpler to do it this way. While this trades off `ChannelMonitorUpdate` size somewhat (the `HTLCSource` is included in many of them), which we should be sensitive to, it will also allow us to (eventually) remove the SCID -> peer + channel_id lookups we do when claiming or failing today, which are somewhat annoying to deal with.

If we claim an MPP payment and only persist some of the `ChannelMonitorUpdate`s which include the preimage prior to shutting down, we may be in a state where some of our `ChannelMonitor`s have the preimage for a payment while others do not. This, it turns out, is actually mostly safe - on startup `ChanelManager` will detect there's a payment it has as unclaimed but there's a corresponding payment preimage in a `ChannelMonitor` and go claim the other MPP parts. This works so long as the `ChannelManager` has been persisted after the payment has been received but prior to the `PaymentClaimable` event being processed (and the claim itself occurring). This is not always true and certainly not required on our API, but our `lightning-background-processor` today does persist prior to event handling so is generally true subject to some race conditions. In order to address this race we need to use copy payment preimages across channels irrespective of the `ChannelManager`'s payment state, but this introduces another wrinkle - if one channel makes substantial progress while other channel(s) are still waiting to get the payment preimage in `ChannelMonitor`(s) while the `ChannelManager` hasn't been persisted after the payment was received, we may end up without the preimage on disk. Here, we address this issue with a new `RAAMonitorUpdateBlockingAction` variant for this specific case. We block persistence of an RAA `ChannelMonitorUpdate` which may remove the preimage from disk until all channels have had the preimage added to their `ChannelMonitor`. We do this only in-memory (and not on disk) as we can recreate this blocker during the startup re-claim logic. This will enable us to claim MPP parts without using the `ChannelManager`'s payment state in later work.

If we claim an MPP payment and only persist some of the `ChannelMonitorUpdate`s which include the preimage prior to shutting down, we may be in a state where some of our `ChannelMonitor`s have the preimage for a payment while others do not. This, it turns out, is actually mostly safe - on startup `ChanelManager` will detect there's a payment it has as unclaimed but there's a corresponding payment preimage in a `ChannelMonitor` and go claim the other MPP parts. This works so long as the `ChannelManager` has been persisted after the payment has been received but prior to the `PaymentClaimable` event being processed (and the claim itself occurring). This is not always true and certainly not required on our API, but our `lightning-background-processor` today does persist prior to event handling so is generally true subject to some race conditions. In order to address this we need to use copy payment preimages across channels irrespective of the `ChannelManager`'s payment state, but this introduces another wrinkle - if one channel makes substantial progress while other channel(s) are still waiting to get the payment preimage in `ChannelMonitor`(s) while the `ChannelManager` hasn't been persisted after the payment was received, we may end up without the preimage on disk. Here, we address this issue by using the new `RAAMonitorUpdateBlockingAction` variant for this specific case. We block persistence of an RAA `ChannelMonitorUpdate` which may remove the preimage from disk until all channels have had the preimage added to their `ChannelMonitor`. We do this only in-memory (and not on disk) as we can recreate this blocker during the startup re-claim logic. This will enable us to claim MPP parts without using the `ChannelManager`'s payment state in later work.

TheBlueMatt · 2024-07-08T19:07:57Z

Squashed with a trivial cleanup:

$ git diff-tree -U3 bd1a8222 2d1306a3
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index f9bf2b963..67159ef52 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -6321,14 +6321,14 @@ where
 				None
 			};
 			for htlc in sources {
-				let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().map(|pending_mpp_claim|
+				let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().and_then(|pending_mpp_claim|
 					if let Some(cp_id) = htlc.prev_hop.counterparty_node_id {
 						let claim_ptr = PendingMPPClaimPointer(Arc::clone(pending_mpp_claim));
 						Some((cp_id, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id, claim_ptr))
 					} else {
 						None
 					}
-				).flatten();
+				);
 				let raa_blocker = pending_mpp_claim_ptr_opt.as_ref().map(|pending_claim| {
 					RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
 						pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),

TheBlueMatt mentioned this pull request Jun 13, 2024

Async Persistence TODOs #3052

Open

24 tasks

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from ed32962 to 63d7bb7 Compare June 13, 2024 01:29

jkczyz self-requested a review June 17, 2024 14:37

valentinewallace self-requested a review June 17, 2024 21:10

jkczyz reviewed Jun 17, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from 63d7bb7 to df19164 Compare June 18, 2024 17:29

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from df19164 to 30c0b5d Compare June 18, 2024 23:50

valentinewallace reviewed Jun 19, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from 30c0b5d to 5ba7723 Compare June 19, 2024 19:20

jkczyz reviewed Jun 20, 2024

View reviewed changes

valentinewallace reviewed Jun 21, 2024

View reviewed changes

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch 2 times, most recently from 0045ded to de40c83 Compare June 25, 2024 17:48

valentinewallace reviewed Jun 27, 2024

View reviewed changes

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from de40c83 to bd1a822 Compare June 27, 2024 16:21

jkczyz reviewed Jul 8, 2024

View reviewed changes

TheBlueMatt added 5 commits July 8, 2024 19:06

Test the new RAAMonitorUpdateBlockingAction::ClaimedMPPPayment

2d1306a

TheBlueMatt force-pushed the 2023-11-mon-claim-bug branch from bd1a822 to 2d1306a Compare July 8, 2024 19:07

jkczyz approved these changes Jul 8, 2024

View reviewed changes

valentinewallace approved these changes Jul 8, 2024

View reviewed changes

TheBlueMatt merged commit bcacf85 into lightningdevkit:main Jul 8, 2024
12 of 17 checks passed

Block monitor updates to ensure preimages are in each MPP part #3120

Block monitor updates to ensure preimages are in each MPP part #3120

Uh oh!

Conversation

TheBlueMatt commented Jun 13, 2024

Uh oh!

codecov-commenter commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace commented Jun 18, 2024

Uh oh!

TheBlueMatt commented Jun 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Jun 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

valentinewallace commented Jun 27, 2024

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Jul 8, 2024

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 13, 2024 •

edited

Loading

valentinewallace Jun 27, 2024 •

edited

Loading