-
Notifications
You must be signed in to change notification settings - Fork 403
Block monitor updates to ensure preimages are in each MPP part #3120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block monitor updates to ensure preimages are in each MPP part #3120
Conversation
ed32962
to
63d7bb7
Compare
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #3120 +/- ##
==========================================
+ Coverage 89.83% 90.54% +0.71%
==========================================
Files 121 121
Lines 98900 106185 +7285
Branches 98900 106185 +7285
==========================================
+ Hits 88847 96150 +7303
+ Misses 7457 7419 -38
- Partials 2596 2616 +20 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to take a more thorough look at the last two commits still.
lightning/src/ln/channelmanager.rs
Outdated
htlc_id: htlc_id.0.unwrap(), | ||
})) | ||
}, | ||
// 1 is ClaimedMPPPayment and is handled in the general odd handling below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expand impl_writeable_tlv_based_enum_upgradable
with an optional section for variants to ignore instead of implementing this by hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, that wasn't as hard as i thought.
63d7bb7
to
df19164
Compare
Needs rebase + not building for me locally |
df19164
to
30c0b5d
Compare
Rebased and fixed the issue. |
30c0b5d
to
5ba7723
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being slow on this. I'm not as familiar with this code. Still need to look at the test.
if let Some((counterparty_node_id, chan_id, htlc_id, claim_ptr)) = pending_mpp_claim { | ||
let per_peer_state = self.per_peer_state.read().unwrap(); | ||
per_peer_state.get(&counterparty_node_id).map(|peer_state_mutex| { | ||
let mut peer_state = peer_state_mutex.lock().unwrap(); | ||
let blockers_entry = peer_state.actions_blocking_raa_monitor_updates.entry(chan_id); | ||
if let btree_map::Entry::Occupied(mut blockers) = blockers_entry { | ||
blockers.get_mut().retain(|blocker| | ||
if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker { | ||
if *pending_claim == claim_ptr { | ||
let mut pending_claim_state_lock = pending_claim.0.lock().unwrap(); | ||
let pending_claim_state = &mut *pending_claim_state_lock; | ||
pending_claim_state.channels_without_preimage.retain(|(cp, outp, cid, hid)| { | ||
if *cp == counterparty_node_id && *cid == chan_id && *hid == htlc_id { | ||
pending_claim_state.channels_with_preimage.push((*cp, *outp, *cid)); | ||
false | ||
} else { true } | ||
}); | ||
if pending_claim_state.channels_without_preimage.is_empty() { | ||
for (cp, outp, cid) in pending_claim_state.channels_with_preimage.iter() { | ||
freed_channels.push((*cp, *outp, *cid, blocker.clone())); | ||
} | ||
} | ||
!pending_claim_state.channels_without_preimage.is_empty() | ||
} else { true } | ||
} else { true } | ||
); | ||
if blockers.get().is_empty() { | ||
blockers.remove(); | ||
} | ||
} | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid rustfmt from mangling this later, might be worth turning this into some chained calls. Might flatten it a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't actually see much room to flatten it :/ Most of the ifs are pattern-matching, not just checking and we have more logic than can just be done with a filter :/
let peer_state_mutex = per_peer_state.entry(counterparty_node_id).or_insert_with(|| | ||
Mutex::new(PeerState { | ||
channel_by_id: new_hash_map(), | ||
inbound_channel_request_by_id: new_hash_map(), | ||
latest_features: InitFeatures::empty(), | ||
pending_msg_events: Vec::new(), | ||
in_flight_monitor_updates: BTreeMap::new(), | ||
monitor_update_blocked_actions: BTreeMap::new(), | ||
actions_blocking_raa_monitor_updates: BTreeMap::new(), | ||
is_connected: false, | ||
})); | ||
let mut peer_state = peer_state_mutex.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, in what case would we not have a PeerState
for a counterparty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're disconnected from the peer and we have no more channels, we may have pruned the PeerState
entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay
0045ded
to
de40c83
Compare
Rebased and addressed comments. |
// Note that FreeOtherChannelImmediately should never be written - we were supposed to free | ||
// *immediately*. However, for simplicity we implement read/write here. | ||
(1, FreeOtherChannelImmediately) => { | ||
(0, downstream_counterparty_node_id, required), | ||
(2, downstream_funding_outpoint, required), | ||
(4, blocking_action, required), | ||
(4, blocking_action, upgradable_required), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use unread_variants
for this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe? Note that this isn't an encoding change, anyway - it just can sometimes short-circuit and return None
for the whole enum.
lightning/src/ln/channelmanager.rs
Outdated
let (source_claim_pairs, pending_claim_ptr_opt) = if sources.len() > 1 { | ||
let mut pending_claims = PendingMPPClaim { | ||
channels_without_preimage: Vec::new(), | ||
channels_with_preimage: Vec::new(), | ||
}; | ||
let (source_claim_pairs, channels_without_preimage) = sources.into_iter().filter_map(|htlc| { | ||
if let Some(cp_id) = htlc.prev_hop.counterparty_node_id { | ||
let htlc_id = htlc.prev_hop.htlc_id; | ||
let chan_id = htlc.prev_hop.channel_id; | ||
let chan_outpoint = htlc.prev_hop.outpoint; | ||
Some(( | ||
(htlc, Some((cp_id, chan_id, htlc_id))), | ||
(cp_id, chan_outpoint, chan_id, htlc_id), | ||
)) | ||
} else { | ||
None | ||
} | ||
}).unzip::<_, _, Vec<_>, _>(); | ||
pending_claims.channels_without_preimage = channels_without_preimage; | ||
(source_claim_pairs, Some(Arc::new(Mutex::new(pending_claims)))) | ||
} else { | ||
(sources.into_iter().map(|htlc| (htlc, None)).collect(), None) | ||
}; | ||
for (htlc, mpp_claim) in source_claim_pairs { | ||
let mut pending_mpp_claim = None; | ||
let pending_claim_ptr = pending_claim_ptr_opt.as_ref().map(|pending_claim| { | ||
pending_mpp_claim = mpp_claim.map(|(cp_id, chan_id, htlc_id)| | ||
(cp_id, chan_id, htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim))) | ||
); | ||
RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { | ||
pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)), | ||
} | ||
}); | ||
self.claim_funds_from_hop( | ||
htlc.prev_hop, payment_preimage, | ||
|_, definitely_duplicate| { | ||
debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment"); | ||
Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash }) | ||
(Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), pending_claim_ptr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find writing it this way a lot less confusing rather than using the intermediate source_claim_pairs
, getting rid of the remaining mut
s and some renames:
let pending_mpp_claim_opt = if sources.len() > 1 {
let channels_without_preimage = sources.iter().filter_map(|htlc| {
htlc.prev_hop.counterparty_node_id.map(|cp_id| {
(cp_id, htlc.prev_hop.outpoint, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id)
})
}).collect();
Some(Arc::new(Mutex::new(PendingMPPClaim {
channels_without_preimage,
channels_with_preimage: Vec::new()
})))
} else { None };
for htlc in sources.into_iter() {
let raa_blocker = pending_mpp_claim_opt.as_ref().map(|pending_claim| {
RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)),
}
});
let pending_mpp_claim = pending_mpp_claim_opt.as_ref().map(|pending_claim| {
(htlc.prev_hop.counterparty_node_id.expect("We checked this"), htlc.prev_hop.channel_id,
htlc.prev_hop.htlc_id, PendingMPPClaimPointer(Arc::clone(pending_claim)))
});
self.claim_funds_from_hop(
htlc.prev_hop, payment_preimage,
|_, definitely_duplicate| {
debug_assert!(!definitely_duplicate, "We shouldn't claim duplicatively from a payment");
(Some(MonitorUpdateCompletionAction::PaymentClaimed { payment_hash, pending_mpp_claim }), raa_blocker)
}
);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, indeed that is much more readable. I further tweaked the variable names slightly.
LGTM after outstanding feedback and Jeff takes another look. Would be good with a squash. |
de40c83
to
bd1a822
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, feel free to squash. Looks good aside from some minor comments.
blockers.get_mut().retain(|blocker| | ||
if let &RAAMonitorUpdateBlockingAction::ClaimedMPPPayment { pending_claim } = &blocker { | ||
if *pending_claim == claim_ptr { | ||
let mut pending_claim_state_lock = pending_claim.0.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any lock order concerns? I guess the nesting defines what is allowable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm? I'm not sure I see any lockorder concerns here, we always take the new inner lock as the last lock, which is always safe. More generally, our lockorder violation detection is quite good, so as long as we hit a lockorder inversion once in tests we should fail.
@@ -6324,7 +6361,9 @@ where | |||
} | |||
} | |||
|
|||
fn claim_funds_from_hop<ComplFunc: FnOnce(Option<u64>, bool) -> Option<MonitorUpdateCompletionAction>>( | |||
fn claim_funds_from_hop< | |||
ComplFunc: FnOnce(Option<u64>, bool) -> (Option<MonitorUpdateCompletionAction>, Option<RAAMonitorUpdateBlockingAction>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like (None, Some)
is not possible. Should we consider a different nesting or possibly a three-variant enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good catch, not sure it matters though? It complicates the code in the on-chain-claim branch a tiny bit, but more importantly they're logically distinct concepts, and there's nothing wrong with there being an RAA blocker without an event, we just don't have a use for it atm.
In some cases, we have variants of an enum serialized using `impl_writeable_tlv_based_enum_upgradable` which we don't want to write/read. Here we add support for such variants by writing them using the (odd) type 255 without any contents and using `MaybeReadable` to decline to read them.
Because we track pending `ChannelMonitorUpdate`s per-peer, we really need to know what peer an HTLC came from when we go to claim it on-chain, allowing us to look up any blocked actions in the `PeerState`. While we could do this by moving the blocked actions to some global "closed-channel update queue", its simpler to do it this way. While this trades off `ChannelMonitorUpdate` size somewhat (the `HTLCSource` is included in many of them), which we should be sensitive to, it will also allow us to (eventually) remove the SCID -> peer + channel_id lookups we do when claiming or failing today, which are somewhat annoying to deal with.
If we claim an MPP payment and only persist some of the `ChannelMonitorUpdate`s which include the preimage prior to shutting down, we may be in a state where some of our `ChannelMonitor`s have the preimage for a payment while others do not. This, it turns out, is actually mostly safe - on startup `ChanelManager` will detect there's a payment it has as unclaimed but there's a corresponding payment preimage in a `ChannelMonitor` and go claim the other MPP parts. This works so long as the `ChannelManager` has been persisted after the payment has been received but prior to the `PaymentClaimable` event being processed (and the claim itself occurring). This is not always true and certainly not required on our API, but our `lightning-background-processor` today does persist prior to event handling so is generally true subject to some race conditions. In order to address this race we need to use copy payment preimages across channels irrespective of the `ChannelManager`'s payment state, but this introduces another wrinkle - if one channel makes substantial progress while other channel(s) are still waiting to get the payment preimage in `ChannelMonitor`(s) while the `ChannelManager` hasn't been persisted after the payment was received, we may end up without the preimage on disk. Here, we address this issue with a new `RAAMonitorUpdateBlockingAction` variant for this specific case. We block persistence of an RAA `ChannelMonitorUpdate` which may remove the preimage from disk until all channels have had the preimage added to their `ChannelMonitor`. We do this only in-memory (and not on disk) as we can recreate this blocker during the startup re-claim logic. This will enable us to claim MPP parts without using the `ChannelManager`'s payment state in later work.
If we claim an MPP payment and only persist some of the `ChannelMonitorUpdate`s which include the preimage prior to shutting down, we may be in a state where some of our `ChannelMonitor`s have the preimage for a payment while others do not. This, it turns out, is actually mostly safe - on startup `ChanelManager` will detect there's a payment it has as unclaimed but there's a corresponding payment preimage in a `ChannelMonitor` and go claim the other MPP parts. This works so long as the `ChannelManager` has been persisted after the payment has been received but prior to the `PaymentClaimable` event being processed (and the claim itself occurring). This is not always true and certainly not required on our API, but our `lightning-background-processor` today does persist prior to event handling so is generally true subject to some race conditions. In order to address this we need to use copy payment preimages across channels irrespective of the `ChannelManager`'s payment state, but this introduces another wrinkle - if one channel makes substantial progress while other channel(s) are still waiting to get the payment preimage in `ChannelMonitor`(s) while the `ChannelManager` hasn't been persisted after the payment was received, we may end up without the preimage on disk. Here, we address this issue by using the new `RAAMonitorUpdateBlockingAction` variant for this specific case. We block persistence of an RAA `ChannelMonitorUpdate` which may remove the preimage from disk until all channels have had the preimage added to their `ChannelMonitor`. We do this only in-memory (and not on disk) as we can recreate this blocker during the startup re-claim logic. This will enable us to claim MPP parts without using the `ChannelManager`'s payment state in later work.
bd1a822
to
2d1306a
Compare
Squashed with a trivial cleanup: $ git diff-tree -U3 bd1a8222 2d1306a3
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index f9bf2b963..67159ef52 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -6321,14 +6321,14 @@ where
None
};
for htlc in sources {
- let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().map(|pending_mpp_claim|
+ let this_mpp_claim = pending_mpp_claim_ptr_opt.as_ref().and_then(|pending_mpp_claim|
if let Some(cp_id) = htlc.prev_hop.counterparty_node_id {
let claim_ptr = PendingMPPClaimPointer(Arc::clone(pending_mpp_claim));
Some((cp_id, htlc.prev_hop.channel_id, htlc.prev_hop.htlc_id, claim_ptr))
} else {
None
}
- ).flatten();
+ );
let raa_blocker = pending_mpp_claim_ptr_opt.as_ref().map(|pending_claim| {
RAAMonitorUpdateBlockingAction::ClaimedMPPPayment {
pending_claim: PendingMPPClaimPointer(Arc::clone(pending_claim)), |
If we claim an MPP payment and only persist some of the
ChannelMonitorUpdate
s which include the preimage prior toshutting down, we may be in a state where some of our
ChannelMonitor
s have the preimage for a payment while others donot.
This, it turns out, is actually mostly safe - on startup
ChanelManager
will detect there's a payment it has as unclaimedbut there's a corresponding payment preimage in a
ChannelMonitor
and go claim the other MPP parts. This works so long as the
ChannelManager
has been persisted after the payment has beenreceived but prior to the
PaymentClaimable
event being processed(and the claim itself occurring). This is not always true and
certainly not required on our API, but our
lightning-background-processor
today does persist prior to eventhandling so is generally true subject to some race conditions.
In order to address this we need to use copy payment preimages
across channels irrespective of the
ChannelManager
's paymentstate, but this introduces another wrinkle - if one channel makes
substantial progress while other channel(s) are still waiting to
get the payment preimage in
ChannelMonitor
(s) while theChannelManager
hasn't been persisted after the payment wasreceived, we may end up without the preimage on disk.
Here, we address this issue by using the new
RAAMonitorUpdateBlockingAction
variant for this specific case. Weblock persistence of an RAA
ChannelMonitorUpdate
which may removethe preimage from disk until all channels have had the preimage
added to their
ChannelMonitor
.We do this only in-memory (and not on disk) as we can recreate this
blocker during the startup re-claim logic.
This will enable us to claim MPP parts without using the
ChannelManager
's payment state in later work.