Skip to content

Handle double-HTLC-claims without failing the backwards channel #977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TheBlueMatt
Copy link
Collaborator

When receiving an update_fulfill_htlc message, we immediately
forward the claim backwards along the payment path before waiting
for a full commitment_signed dance. This is great, but can cause
duplicative claims if a node sends an update_fulfill_htlc message,
disconnects, reconnects, and then has to re-send its
update_fulfill_htlc message again.

While there was code to handle this, it treated it as a channel
error on the inbound channel, which is incorrect - this is an
expected, albeit incredibly rare, condition. Instead, we handle
these double-claims correctly, simply ignoring them.

With debug_assertions enabled, we also check that the previous
close of the same HTLC was a fulfill, and that we are not moving
from a HTLC failure to an HTLC claim after its too late.

A test is also added, which hits all three failure cases in
Channel::get_update_fulfill_htlc.

Found by the chanmon_consistency fuzzer.

@TheBlueMatt TheBlueMatt added this to the 0.1 milestone Jun 30, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from 4532e33 to db5ddcb Compare June 30, 2021 03:12
@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

Merging #977 (65fcf57) into main (431f807) will increase coverage by 1.64%.
The diff coverage is 94.69%.

❗ Current head 65fcf57 differs from pull request most recent head f06f9d1. Consider uploading reports for the commit f06f9d1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #977      +/-   ##
==========================================
+ Coverage   90.73%   92.37%   +1.64%     
==========================================
  Files          60       60              
  Lines       30702    37203    +6501     
==========================================
+ Hits        27857    34368    +6511     
+ Misses       2845     2835      -10     
Impacted Files Coverage Δ
lightning/src/ln/channelmanager.rs 88.56% <77.14%> (+3.83%) ⬆️
lightning/src/ln/channel.rs 92.02% <92.85%> (+3.51%) ⬆️
lightning/src/chain/channelmonitor.rs 90.80% <100.00%> (ø)
lightning/src/ln/chanmon_update_fail_tests.rs 98.33% <100.00%> (+0.39%) ⬆️
lightning/src/ln/functional_test_utils.rs 95.89% <100.00%> (+1.05%) ⬆️
lightning/src/ln/functional_tests.rs 98.75% <100.00%> (+1.49%) ⬆️
lightning/src/ln/onion_route_tests.rs 96.53% <100.00%> (-0.28%) ⬇️
lightning/src/util/message_signing.rs 93.02% <0.00%> (-0.09%) ⬇️
lightning/src/ln/features.rs 99.78% <0.00%> (+0.07%) ⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 431f807...f06f9d1. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from db5ddcb to 541411d Compare June 30, 2021 03:33
@TheBlueMatt TheBlueMatt modified the milestones: 0.1, 0.0.100 Jul 2, 2021
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from 541411d to 14fd4a2 Compare July 7, 2021 17:49
return Ok((None, Some(monitor_update)));
}
#[cfg(any(test, feature = "fuzztarget"))]
self.historical_inbound_htlc_fulfills.insert(htlc_id_arg);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add a contains() check inside the HTLCUpdateAwaitingACK::ClaimHTLC branch in get_update_fail_htlc to pass the debug_assert in case of downgrade from update_fulfill_htlcto update_fail_htlc, which is also a correct behavior ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always debug_assert!(false, ... there, because its a bug to duplicate-fail (as failing waits for the commitment signed dance to complete).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the sequence success-failure? We don't wait on forward channel full commitment_signed_dance to pass success backward, so the get_update_fail_htlc in process_pending_htlc_forwards can be reached second time ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Yep, you're totally right, worse the fuzzer won't hit that case. I updated the new test in this PR to hit that case and fixed bugs.

// can cause duplicative claims if a node sends an update_fulfill_htlc message, disconnects,
// reconnects, and then has to re-send its update_fulfill_htlc message again.
// In previous code, we didn't handle the double-claim correctly, spuriously closing the
// channel on which the inbound HTLC was received.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that previous to this PR, if we did receive a duplicate update_fulfiil_htlc on the forward channel, we would have close the backward one ? You're referring to current get_update_fulfill_htlc or another behavior? Don't get it here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that previous to this PR, if we did receive a duplicate update_fulfiil_htlc on the forward channel, we would have close the backward one

Yes, that's correct. We'd debug_assert!(false, ... then close the channel when debug assertions were off.

fn test_reconnect_dup_htlc_claims() {
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::Received);
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::HoldingCell);
do_test_reconnect_dup_htlc_claims(HTLCStatusAtDupClaim::Cleared);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity the third case you're mentioning in commit isn't exercised here ?

With debug_assertions enabled, we also check that the previous
close of the same HTLC was a fulfill, and that we are not moving
from a HTLC failure to an HTLC claim after its too late.

I translate as hitting the HTLCUpdateAwaitingACK::FailHTLC branch in get_update_fulfill_htlc ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity the third case you're mentioning in commit isn't exercised here ?

Yes, I do not believe that is possible to hit, so I certainly hope we aren't exercising it :).

I translate as hitting the HTLCUpdateAwaitingACK::FailHTLC branch in get_update_fulfill_htlc ?

Yes, roughly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we shouldn't be able to hit if we dully wait the commitment_signed dance on forward before to pass the update_fail_htlc

@ariard
Copy link

ariard commented Jul 14, 2021

Code Review ACK b64ab91 modulo #977 (comment)

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from b64ab91 to 40b7615 Compare July 14, 2021 18:23
@TheBlueMatt
Copy link
Collaborator Author

Pushed additional commits to simplify the call graph further and handle a TODO that's been in the code for some time.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from f4c2f80 to 9861d16 Compare July 15, 2021 22:45
@valentinewallace valentinewallace self-requested a review July 21, 2021 20:22
Comment on lines 1360 to 1363
(Some(_), None) => {
// If we generated a claim message, we absolutely should have generated a
// ChannelMonitorUpdate, otherwise we are going to probably lose funds.
unreachable!();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for the compiler to enforce this by making the return value an Enum? I think that would also allow us to get rid of the unwrap on line 2470 in free_holding_cell method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it drops the unwrap, sadly, but, yea, you're right, I simplified it a bit more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new return value of Option<(Option<msgs::UpdateFulfillHTLC>, ChannelMonitorUpdate)> isn't too easy on the eyes. Think we could add an abstraction, like an Enum or struct to encapsulate it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I made it into a little enum.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from 9861d16 to dac9f3f Compare July 22, 2021 21:21
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK dac9f3f

return Err(None)
Err((e, monitor_update)) => {
if let Err(e) = self.chain_monitor.update_channel(chan.get().get_funding_txo().unwrap(), monitor_update) {
log_error!(self.logger, "Critical error: failed to update channel monitor with preimage {:?}: {:?}",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree we're pretty stuck in this catastrophic scenario...if the error source was tight to new state key derivation we can still hope for the counterparty to broadcast its own and claim the offered output on it, though also assume we still have the key for this htlcpubkey...a lot of ifs!

One upgrade of your KeysInterface could also be to inform when signing abortion has detrimental consequences for funds safety, though I expect the claim back to be automated for routing nodes.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is shaping up to me! Just have a few more nits. I think the return values on get_update_fulfill and others are a bit complicated to parse, though that seems to be mostly an existing issue. Not sure if we want to spend the time on this release to make them nicer

return Err(None)
Err((e, monitor_update)) => {
if let Err(e) = self.chain_monitor.update_channel(chan.get().get_funding_txo().unwrap(), monitor_update) {
log_error!(self.logger, "Critical error: failed to update channel monitor with preimage {:?}: {:?}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the above update_channel fail case also Critical error? Just trying to figure out why the logic on update_channel failure isn't symmetric..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I'm still a bit stuck in the old mental model where we'd retry ChannelMonitorUpates for users, but we no longer do that, so its not a "critical" error as long as users can get the update to the monitors. I updated the text to be more clear and logged in both places (with appropriate log levels).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean here by getting the update to the monitors ? If I'm following well we might have a HTLC claimed forward with the preimage passed as an arg to claim_funds_from_hop but we won't have been able to generate an updated commitment nor able to pass the preimage to our onchain backend. If the counterparty force-close the channel our ChannelMonitor won't claim the due offered HTLC?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, kinda. The docs say you must still deliver the ChannelMonitorUpdate to some (probably in-memory) ChannelMonitor. The PermanentErr case is only for failure to update some ChannelMonitors, never all of them. There isn't anything better we can do, in any case, we've given the update to the user, they have to deliver it somewhere cause we can't do anything further.

@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch 4 times, most recently from 829c3d1 to b35d0e2 Compare July 27, 2021 00:06
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from b35d0e2 to 65fcf57 Compare July 27, 2021 18:59
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review ACK 65fcf57 though let me know about the severity of this : #977 (comment), not sure we're all aligned

When receiving an update_fulfill_htlc message, we immediately
forward the claim backwards along the payment path before waiting
for a full commitment_signed dance. This is great, but can cause
duplicative claims if a node sends an update_fulfill_htlc message,
disconnects, reconnects, and then has to re-send its
update_fulfill_htlc message again.

While there was code to handle this, it treated it as a channel
error on the inbound channel, which is incorrect - this is an
expected, albeit incredibly rare, condition. Instead, we handle
these double-claims correctly, simply ignoring them.

With debug_assertions enabled, we also check that the previous
close of the same HTLC was a fulfill, and that we are not moving
from a HTLC failure to an HTLC claim after its too late.

A test is also added, which hits all three failure cases in
`Channel::get_update_fulfill_htlc`.

Found by the chanmon_consistency fuzzer.
Previously, we could fail to generate a new commitment transaction
but it simply indicated we had gone to doule-claim an HTLC. Now
that double-claims are returned instead as Ok(None), we should
handle the error case and fail the channel, as the only way to hit
the error case is if key derivation failed or the user refused to
sign the new commitment transaction.

This also resolves an issue where we wouldn't inform our
ChannelMonitor of the new payment preimage in case we failed to
fetch a signature for the new commitment transaction.
@TheBlueMatt TheBlueMatt force-pushed the 2021-06-fix-double-claim-close branch from 65fcf57 to f06f9d1 Compare July 28, 2021 00:35
@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Jul 28, 2021

Squashed fixups without diff. Diff from Val's ack is below, will merge after CI.

$ git diff-tree -U1  b35d0e2 f06f9d11
diff --git a/lightning/src/ln/channel.rs b/lightning/src/ln/channel.rs
index 0a847a8a..5a0ceabe 100644
--- a/lightning/src/ln/channel.rs
+++ b/lightning/src/ln/channel.rs
@@ -325,4 +325,4 @@ pub enum UpdateFulfillCommitFetch {
 	},
-	/// Indicates the HTLC fulfill is duplicative, and already existed either in the holding cell,
-	/// or the HTLC is forgotten completely (presumably previously claimed).
+	/// Indicates the HTLC fulfill is duplicative and already existed either in the holding cell
+	/// or has been forgotten (presumably previously claimed).
 	DuplicateClaim {},
@@ -2509,3 +2509,3 @@ impl<Signer: Sign> Channel<Signer> {
 								// If an HTLC failure was previously added to the holding cell (via
-								// `get_update_fail_htlc`, then generating the fail message itself
+								// `get_update_fail_htlc`) then generating the fail message itself
 								// must not fail - we should never end up in a state where we
$

@TheBlueMatt TheBlueMatt merged commit 1bb9e64 into lightningdevkit:main Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants