-
Notifications
You must be signed in to change notification settings - Fork 409
Fail back HTLCs that fail to be freed from the holding cell #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail back HTLCs that fail to be freed from the holding cell #640
Conversation
Codecov Report
@@ Coverage Diff @@
## master #640 +/- ##
==========================================
+ Coverage 91.30% 91.38% +0.07%
==========================================
Files 35 35
Lines 21400 21663 +263
==========================================
+ Hits 19539 19796 +257
- Misses 1861 1867 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks great! Just one comment though not sure if can be addressed.
22950b8
to
38b44f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't do a super detailed review, but the new logic is definitely the right direction.
// needs to be surfaced to the user. | ||
fn fail_htlcs(&self, mut htlcs_to_fail: Vec<(HTLCSource, PaymentHash)>, channel_id: [u8; 32]) -> Result<(), MsgHandleErrInternal> { | ||
for (htlc_src, payment_hash) in htlcs_to_fail.drain(..) { | ||
match htlc_src { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this could be DRY'd, but didn't dig into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I DRY'd it a liiittle bit but LMK if you have a better way in mind.
4f38bad
to
6cf1a5d
Compare
@@ -2115,7 +2115,7 @@ impl<ChanSigner: ChannelKeys> Channel<ChanSigner> { | |||
|
|||
/// Used to fulfill holding_cell_htlcs when we get a remote ack (or implicitly get it by them | |||
/// fulfilling or failing the last pending HTLC) | |||
fn free_holding_cell_htlcs<L: Deref>(&mut self, logger: &L) -> Result<Option<(msgs::CommitmentUpdate, ChannelMonitorUpdate)>, ChannelError> where L::Target: Logger { | |||
fn free_holding_cell_htlcs<L: Deref>(&mut self, logger: &L) -> Result<(Option<(msgs::CommitmentUpdate, ChannelMonitorUpdate)>, Vec<(HTLCSource, PaymentHash)>), ChannelError> where L::Target: Logger { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting concerned that the return values are growing more complicated. These bleed out into the public interface (e.g., revoke_and_ack
, channel_reestablish
), which have even more complicated return values -- lots of None
s and empty Vec
s returned.
Not sure how much this should be considered now (they were already quite complicated), but it may be an indication that the interaction between Channel
and ChannelManager
may need to be rethought in the future. Would love to hear any insights you may have had working with these modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd agree that there's a lot of return values, though it reassures me that Channel's "public" API isn't intended to be seen by RL users.
I looked into refactoring it to have fewer return values but nothing jumped out as low-hanging fruit. I'd have to take a deeper look. Open an issue, maybe?
lightning/src/ln/functional_tests.rs
Outdated
} | ||
|
||
#[test] | ||
fn test_holding_cell_htlc_with_pending_fee_update_multihop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be valuable for posterity if these two tests had high-level documentation. The difference between the two names is the single word "multihop" which -- while accurately differentiating the scenarios -- abstracts away much of the additional behavior that is tested.
Would you mind adding a few lines each stating the scenario and behaviors tested? In general, doing so may possibly lend to finding a succinct test name as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated -- let me know if you think the comments still need work.
lightning/src/ln/channel.rs
Outdated
// into the holding cell without ever being | ||
// successfully forwarded/failed/fulfilled, causing | ||
// our counterparty to eventually close on us. | ||
htlcs_to_fail.push((source.clone(), *payment_hash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the outter match we do this:
if err.is_some() {
self.holding_cell_htlc_updates.push(htlc_update);
if let Some(ChannelError::Ignore(_)) = err {
// If we failed to add the HTLC, but got an Ignore error, we should
// still send the new commitment_signed, so reset the err to None.
err = None;
}
Which, IIUC, re-pushes the send onto the holding cell, even though we're about to fail it. ISTM this will means we'll try to forward that HTLC again later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, should've caught that. Do you think it's worth (maybe in a follow-up PR) adding to this area to make sure there's no holding cell HTLCs when a test finishes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yea, that would be cool! Probably would hit a few tests that fail just cause its a new requirement but probably worth doing.
2c333cf
to
5db8841
Compare
lightning/src/ln/channel.rs
Outdated
if let Some(ChannelError::Ignore(_)) = err { | ||
// If we failed to add the HTLC, but got an Ignore error, we should | ||
// still send the new commitment_signed, so reset the err to None. | ||
err = None; | ||
} else { | ||
self.holding_cell_htlc_updates.push(htlc_update); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think its fine to move this here, but it would be nice to add a comment why its fine to drop ChannelError::Ignore responses from get_update_fail_htlc and get_update_fulfill_htlc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope I interpreted this correctly
Once you add that comment, feel free to squash, I think this is fine. |
3eb9e41
to
ffcf9c9
Compare
Hmm, ok, looking at this again sorry for the delay. Looks like the tests all pass dropping the err tracking stuff entirely (see patch below). I think this would totally be the right direction - the previous comment calls out a possible case of hitting Err(Ignore) when we end up back in The one exception which I think may need longer-term work is the
|
d436792
to
ed48943
Compare
ed48943
to
492041b
Compare
0df7bb5
to
f9290a7
Compare
lightning/src/ln/channelmanager.rs
Outdated
if commitment_update.is_none() { | ||
order = RAACommitmentOrder::RevokeAndACKFirst; | ||
} | ||
return_monitor_err!(self, e, channel_state, chan, order, revoke_and_ack.is_some(), commitment_update.is_some()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its a bug if we get htlcs_to_fail and then return early here without actually failing the HTLCs before returning (of course I don't know that we can get HTLCs to fail here to begin with, but its not ideal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. However, my understanding from our slack conversation on Friday was that we can just assert that there are no htlcs to fail backwards (given that we drop all holding cell HTLC forwards on peer disconnect). Let me know if that isn't right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! Can you add a more conservative assert?
diff --git a/lightning/src/ln/channel.rs b/lightning/src/ln/channel.rs
index cf2ccb9b..9e7b653e 100644
--- a/lightning/src/ln/channel.rs
+++ b/lightning/src/ln/channel.rs
@@ -2827,6 +2827,11 @@ impl<ChanSigner: ChannelKeys> Channel<ChanSigner> {
// update_adds should have been dropped on peer disconnect. If this changes in
// the future, corresponding changes will need to be made in the ChannelManager's
// reestablish logic, because the logic assumes there are no HTLCs to fail backwards.
+ for htlc_update in self.holding_cell_htlc_updates.iter() {
+ if let &HTLCUpdateAwaitingACK::AddHTLC { .. } = htlc_update {
+ debug_assert!(false, "We don't handle some pending add-HTLC edge-cases properly on reconnect, and they shouldn't be there to begin with");
+ }
+ }
match self.free_holding_cell_htlcs(logger) {
Err(ChannelError::Close(msg)) => return Err(ChannelError::Close(msg)),
Err(ChannelError::Ignore(_)) | Err(ChannelError::CloseDelayBroadcast(_)) => panic!("Got non-channel-failing result from free_holding_cell_htlcs"),
7b0841f
to
a79e075
Compare
lightning/src/ln/channelmanager.rs
Outdated
@@ -2679,7 +2718,7 @@ impl<ChanSigner: ChannelKeys, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> | |||
return Err(MsgHandleErrInternal::send_err_msg_no_close("Got a message for a channel from the wrong node!".to_owned(), msg.channel_id)); | |||
} | |||
let was_frozen_for_monitor = chan.get().is_awaiting_monitor_update(); | |||
let (commitment_update, pending_forwards, pending_failures, closing_signed, monitor_update) = | |||
let (commitment_update, pending_forwards, pending_failures, closing_signed, monitor_update, htlcs_to_fail) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, isn't it an issue that we may return early two lines down and thus not fail back the HTLCs which were returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm yeah, I think so. Is that an existing problem for the pending failures here too?
Edit: never mind, i see that those failures are saved back in the Channel to be processed later
Sadly due to codecov breakage I think this needs rebase on master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks good sans the comments left.
a79e075
to
b8750c7
Compare
lightning/src/ln/channelmanager.rs
Outdated
} | ||
}; | ||
|
||
// Failing back holding cell HTLCs requires acquiring the channel state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can drop the channel_state lock between getting the return value from revoke_and_ack and handling it - otherwise we introduce race issues where we may eg drop the lock, have a different thread go forward HTLCs into the channel, and then take the lock and try to submit monitor updates etc. That would result in messages getting sent to the peer out of order or monitor updates happening out of order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. To be clear, I'm interpreting this as: these lines should also be moved to within the first lock (in addition to moving the holding-cell-htlc-failings to within the first lock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think basically everything which was under the lock previously needs to remain under the lock, we just need to also be able to process the fail-backs even if we want to return early.
4f299e4
to
47230b3
Compare
Hmm, instead of holding the lock through fail-backs, maybe its easier to just wrap internal_revoke_and_ack to internal_revoke_and_ack_pre_fails and return the list of fails to an outer function. That would also make it eaiser if we move forward with per-channel locks. |
47230b3
to
14b02f2
Compare
I'm unsure if there's a way around making |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, right, that is awkward. One alternative to try is to use the break_chan_entry!() macro instead and put the whole thing inside of something you can break
out of (ie a dummy loop {}). That gives you manual control of the control-flow without having to do a full return
- then you can just store the htlcs_to_fail on a separate stack entry and set it when it gets returned to you.
Good to know about I'm sorta back to my original solution? But I might just make a bespoke |
This compiles for me (though I didn't test it):
|
Plus add a test.
Oh, that works great 🤦♀️ I was putting the loop inside the lock. |
14b02f2
to
523cab8
Compare
Also update comment explaining when this case would be hit.
This case was discussed on Slack a long time ago due to me being confused about the original comment, and adding a test was suggested.