-
Notifications
You must be signed in to change notification settings - Fork 407
Handle double-HTLC-claims without failing the backwards channel #977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle double-HTLC-claims without failing the backwards channel #977
Conversation
4532e33
to
db5ddcb
Compare
Codecov Report
@@ Coverage Diff @@
## main #977 +/- ##
==========================================
+ Coverage 90.73% 92.37% +1.64%
==========================================
Files 60 60
Lines 30702 37203 +6501
==========================================
+ Hits 27857 34368 +6511
+ Misses 2845 2835 -10
Continue to review full report at Codecov.
|
db5ddcb
to
541411d
Compare
541411d
to
14fd4a2
Compare
return Ok((None, Some(monitor_update))); | ||
} | ||
#[cfg(any(test, feature = "fuzztarget"))] | ||
self.historical_inbound_htlc_fulfills.insert(htlc_id_arg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a contains()
check inside the HTLCUpdateAwaitingACK::ClaimHTLC
branch in get_update_fail_htlc
to pass the debug_assert in case of downgrade from update_fulfill_htlc
to update_fail_htlc
, which is also a correct behavior ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always debug_assert!(false, ...
there, because its a bug to duplicate-fail (as failing waits for the commitment signed dance to complete).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the sequence success-failure? We don't wait on forward channel full commitment_signed_dance to pass success backward, so the get_update_fail_htlc
in process_pending_htlc_forwards
can be reached second time ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Yep, you're totally right, worse the fuzzer won't hit that case. I updated the new test in this PR to hit that case and fixed bugs.
// can cause duplicative claims if a node sends an update_fulfill_htlc message, disconnects, | ||
// reconnects, and then has to re-send its update_fulfill_htlc message again. | ||
// In previous code, we didn't handle the double-claim correctly, spuriously closing the | ||
// channel on which the inbound HTLC was received. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that previous to this PR, if we did receive a duplicate update_fulfiil_htlc
on the forward channel, we would have close the backward one ? You're referring to current get_update_fulfill_htlc
or another behavior? Don't get it here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that previous to this PR, if we did receive a duplicate update_fulfiil_htlc on the forward channel, we would have close the backward one
Yes, that's correct. We'd debug_assert!(false, ...
then close the channel when debug assertions were off.
fn test_reconnect_dup_htlc_claims() { | ||
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::Received); | ||
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::HoldingCell); | ||
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::Cleared); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity the third case you're mentioning in commit isn't exercised here ?
With debug_assertions enabled, we also check that the previous
close of the same HTLC was a fulfill, and that we are not moving
from a HTLC failure to an HTLC claim after its too late.
I translate as hitting the HTLCUpdateAwaitingACK::FailHTLC
branch in get_update_fulfill_htlc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity the third case you're mentioning in commit isn't exercised here ?
Yes, I do not believe that is possible to hit, so I certainly hope we aren't exercising it :).
I translate as hitting the HTLCUpdateAwaitingACK::FailHTLC branch in get_update_fulfill_htlc ?
Yes, roughly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we shouldn't be able to hit if we dully wait the commitment_signed dance on forward before to pass the update_fail_htlc
Code Review ACK b64ab91 modulo #977 (comment) |
b64ab91
to
40b7615
Compare
Pushed additional commits to simplify the call graph further and handle a TODO that's been in the code for some time. |
f4c2f80
to
9861d16
Compare
lightning/src/ln/channel.rs
Outdated
(Some(_), None) => { | ||
// If we generated a claim message, we absolutely should have generated a | ||
// ChannelMonitorUpdate, otherwise we are going to probably lose funds. | ||
unreachable!(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better for the compiler to enforce this by making the return value an Enum
? I think that would also allow us to get rid of the unwrap
on line 2470
in free_holding_cell
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it drops the unwrap
, sadly, but, yea, you're right, I simplified it a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new return value of Option<(Option<msgs::UpdateFulfillHTLC>, ChannelMonitorUpdate)>
isn't too easy on the eyes. Think we could add an abstraction, like an Enum or struct to encapsulate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I made it into a little enum.
9861d16
to
dac9f3f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK dac9f3f
lightning/src/ln/channelmanager.rs
Outdated
return Err(None) | ||
Err((e, monitor_update)) => { | ||
if let Err(e) = self.chain_monitor.update_channel(chan.get().get_funding_txo().unwrap(), monitor_update) { | ||
log_error!(self.logger, "Critical error: failed to update channel monitor with preimage {:?}: {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree we're pretty stuck in this catastrophic scenario...if the error source was tight to new state key derivation we can still hope for the counterparty to broadcast its own and claim the offered output on it, though also assume we still have the key for this htlcpubkey
...a lot of ifs!
One upgrade of your KeysInterface
could also be to inform when signing abortion has detrimental consequences for funds safety, though I expect the claim back to be automated for routing nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is shaping up to me! Just have a few more nits. I think the return values on get_update_fulfill
and others are a bit complicated to parse, though that seems to be mostly an existing issue. Not sure if we want to spend the time on this release to make them nicer
lightning/src/ln/channelmanager.rs
Outdated
return Err(None) | ||
Err((e, monitor_update)) => { | ||
if let Err(e) = self.chain_monitor.update_channel(chan.get().get_funding_txo().unwrap(), monitor_update) { | ||
log_error!(self.logger, "Critical error: failed to update channel monitor with preimage {:?}: {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the above update_channel
fail case also Critical error
? Just trying to figure out why the logic on update_channel
failure isn't symmetric..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I'm still a bit stuck in the old mental model where we'd retry ChannelMonitorUpate
s for users, but we no longer do that, so its not a "critical" error as long as users can get the update to the monitors. I updated the text to be more clear and logged in both places (with appropriate log levels).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean here by getting the update to the monitors ? If I'm following well we might have a HTLC claimed forward with the preimage passed as an arg to claim_funds_from_hop
but we won't have been able to generate an updated commitment nor able to pass the preimage to our onchain backend. If the counterparty force-close the channel our ChannelMonitor
won't claim the due offered HTLC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, kinda. The docs say you must still deliver the ChannelMonitorUpdate
to some (probably in-memory) ChannelMonitor
. The PermanentErr
case is only for failure to update some ChannelMonitor
s, never all of them. There isn't anything better we can do, in any case, we've given the update to the user, they have to deliver it somewhere cause we can't do anything further.
829c3d1
to
b35d0e2
Compare
b35d0e2
to
65fcf57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review ACK 65fcf57 though let me know about the severity of this : #977 (comment), not sure we're all aligned
When receiving an update_fulfill_htlc message, we immediately forward the claim backwards along the payment path before waiting for a full commitment_signed dance. This is great, but can cause duplicative claims if a node sends an update_fulfill_htlc message, disconnects, reconnects, and then has to re-send its update_fulfill_htlc message again. While there was code to handle this, it treated it as a channel error on the inbound channel, which is incorrect - this is an expected, albeit incredibly rare, condition. Instead, we handle these double-claims correctly, simply ignoring them. With debug_assertions enabled, we also check that the previous close of the same HTLC was a fulfill, and that we are not moving from a HTLC failure to an HTLC claim after its too late. A test is also added, which hits all three failure cases in `Channel::get_update_fulfill_htlc`. Found by the chanmon_consistency fuzzer.
Previously, we could fail to generate a new commitment transaction but it simply indicated we had gone to doule-claim an HTLC. Now that double-claims are returned instead as Ok(None), we should handle the error case and fail the channel, as the only way to hit the error case is if key derivation failed or the user refused to sign the new commitment transaction. This also resolves an issue where we wouldn't inform our ChannelMonitor of the new payment preimage in case we failed to fetch a signature for the new commitment transaction.
65fcf57
to
f06f9d1
Compare
Squashed fixups without diff. Diff from Val's ack is below, will merge after CI.
|
When receiving an update_fulfill_htlc message, we immediately
forward the claim backwards along the payment path before waiting
for a full commitment_signed dance. This is great, but can cause
duplicative claims if a node sends an update_fulfill_htlc message,
disconnects, reconnects, and then has to re-send its
update_fulfill_htlc message again.
While there was code to handle this, it treated it as a channel
error on the inbound channel, which is incorrect - this is an
expected, albeit incredibly rare, condition. Instead, we handle
these double-claims correctly, simply ignoring them.
With debug_assertions enabled, we also check that the previous
close of the same HTLC was a fulfill, and that we are not moving
from a HTLC failure to an HTLC claim after its too late.
A test is also added, which hits all three failure cases in
Channel::get_update_fulfill_htlc
.Found by the chanmon_consistency fuzzer.