Handle async initial ChannelMonitor persistence failing on restart #1678

TheBlueMatt · 2022-08-20T01:08:54Z

Based on #1106, this is the second of many steps towards making the async monitor persistence feature more robust, eventually leading to 0.1 and a safe async KV store API.

If the initial ChannelMonitor persistence is done asynchronously
but does not complete before the node restarts (with a
ChannelManager persistence), we'll start back up with a channel
present but no corresponding ChannelMonitor.

Because the Channel is pending-monitor-update and has not yet
broadcasted its initial funding transaction, this is not a
violation of our API contract nor a safety violation. However, the
previous code would refuse to deserialize the ChannelManager
treating it as an API contract violation.

The solution is to test for this case explicitly and drop the
channel entirely as if the peer disconnected before we received
the funding_signed.

codecov-commenter · 2022-08-24T02:04:50Z

Codecov Report

Base: 90.73% // Head: 91.17% // Increases project coverage by +0.44% 🎉

Coverage data is based on head (763ed22) compared to base (7544030).
Patch coverage: 97.12% of modified lines in pull request are covered.

❗ Current head 763ed22 differs from pull request most recent head 958601f. Consider uploading reports for the commit 958601f to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1678      +/-   ##
==========================================
+ Coverage   90.73%   91.17%   +0.44%     
==========================================
  Files          87       87              
  Lines       46713    49843    +3130     
  Branches    46713    49843    +3130     
==========================================
+ Hits        42383    45444    +3061     
- Misses       4330     4399      +69

Impacted Files	Coverage Δ
lightning/src/util/events.rs	`38.13% <ø> (ø)`
lightning/src/ln/chanmon_update_fail_tests.rs	`97.67% <96.96%> (-0.06%)`	⬇️
lightning/src/ln/channel.rs	`88.67% <97.22%> (+0.01%)`	⬆️
lightning/src/ln/channelmanager.rs	`85.18% <100.00%> (+0.02%)`	⬆️
lightning-net-tokio/src/lib.rs	`76.73% <0.00%> (-0.31%)`	⬇️
lightning/src/ln/functional_tests.rs	`96.93% <0.00%> (-0.12%)`	⬇️
lightning/src/ln/monitor_tests.rs	`99.44% <0.00%> (-0.12%)`	⬇️
lightning/src/ln/msgs.rs	`86.41% <0.00%> (+0.17%)`	⬆️
lightning/src/routing/scoring.rs	`96.34% <0.00%> (+0.20%)`	⬆️
lightning/src/ln/chan_utils.rs	`95.32% <0.00%> (+0.76%)`	⬆️
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

ariard

More parse to be done to check accuracy of code and documentation. Nice the documentation effort towards users!

ariard · 2022-08-29T16:56:09Z

lightning/src/chain/channelmonitor.rs

+	/// secret we gave them that they shouldn't know).
+	///
+	/// Broadcasting these transactions in the second case is UNSAFE, as they allow counterparty
+	/// side to punish you. Nevertheless you may want to broadcast them if counterparty doesn't


Good comment, you might even add "YOU WILL LOSS THE INTEGRALITY OF YOUR CHANNEL BALANCE" to make even clearer that punishment induce loss of the whole funds. It can be hard for an uninformed user to estimate the punishment scope at first sight.

This is #1106, see discussion there. Ultimately the issue is the docs were rewritten there assuming we're sticking with the "make sure you always store the first monitor locally and update it always" model, but we really want to drop that model (which this PR is the first step towards), so we're gonna have to rewrite those docs again soon :(.

ariard · 2022-08-29T17:50:22Z

lightning/src/chain/mod.rs

+	/// or a remote copy of this [`ChannelMonitor`] is no longer reachable and thus not updatable).
+	///
+	/// When this is returned, [`ChannelManager`] will force-close the channel but *not* broadcast
+	/// our current commitment transaction. This avoids a dangerous case where a local disk failure


I think in the future a ChannelManager could still decide to force-close the channel by broadcasting the current commitment transaction if there is a non-related or recoverable disk failure . Let's say you have a chain::Watch implementation coordinating a set of monitor replica (as documented in our glossary). If one of this monitor's Persist implementation is affected by a disk-failure, the error will be reported to the coordinator. If the number of monitor replica affected by a permanent failure is above some security or safety threshold, the coordinator could inform the ChannelManager to force-close channels by prevention (e.g to avoid the in-flight HTLCs exposure to increase). It should be safe to do so, as the latest state available in the available monitor replica should be equivalent to the latest off-chain one.

Just to say, it's good to have that level of documentation at the interface-level, however it's only documenting one model of deployment. As we refine those interfaces to support more fancy types of deployment, I think it would be good to gather it elsewhere, more accessible.

Yea, I don't think that necessarily belongs in the interface directly, though. Users who wish to use a fancier model can manually broadcast at that point. We should, of course, support this, and document it, but I'm not sure the code needs to change. In any case, I basically want to rewrite all the monitor stuff soon-ish, soooo

ariard · 2022-08-29T19:34:14Z

lightning/src/chain/mod.rs

+	/// [`PermanentFailure`]: ChannelMonitorUpdateResult::PermanentFailure
+	/// [`UpdateInProgress`]: ChannelMonitorUpdateResult::UpdateInProgress
+	/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
+	UpdateInProgress,


I think you might note the trade-off brought by UpdateInProgress, if you allow the channel manager to keep moving on its own without state being fully persisted you gain in latency at the HTLC relay-level. However, you're increasing your risks w.r.t to state replication. This trade-off might be acceptable for some routing hops node operators if in the future scoring algorithms starts to encompass latency.

Yea, probably worth mentioning in general, not specific to just the async stuff.

TheBlueMatt · 2022-09-29T22:14:03Z

Rebased after we landed #1106. Now the real work begins :)

lightning/src/ln/channel.rs

wpaulino · 2022-10-04T18:47:47Z

lightning/src/ln/chanmon_update_fail_tests.rs

+
+	let persister: test_utils::TestPersister;
+	let new_chain_monitor: test_utils::TestChainMonitor;
+	let nodes_0_deserialized: ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>;


Would be nice to make a TestChannelManager type alias for this, we have this same type declaration in several places.

Let's do it as a part of #1696?

wpaulino · 2022-10-04T18:55:47Z

lightning/src/ln/chanmon_update_fail_tests.rs

+	nodes[0].chain_source.watched_txn.lock().unwrap().clear();
+	nodes[0].chain_source.watched_outputs.lock().unwrap().clear();
+
+	let nodes_0_serialized = nodes[0].node.encode();


Can we extract the ChannelManager reloading into a macro/function? This seems like something super useful to use across tests and would reduce a good bit of code.

Yea, #1696. I'm not really excited about doing it here - this PR is a part of the bigger 0.1 series which is nontrivial and needs to keep moving.

wpaulino · 2022-10-04T18:58:48Z

lightning/src/ln/chanmon_update_fail_tests.rs

+	nodes[0].node = &nodes_0_deserialized;
+	assert!(nodes_0_read.is_empty());
+
+	check_closed_event!(nodes[0], 1, ClosureReason::DisconnectedPeer);


Let's check that the counterparty also forgets the channel here and in the other test.

It doesn't forget the channel, though. Or, at least, it only will if it connects and gets the error-unknown channel message. Is it worth adding a specific test for that here? (I did add a check-list_channels-is-empty check).

I guess it's not required for the sake of what we're testing, but we could either do that or connect blocks until the 2016 block deadline is reached.

wpaulino

LGTM, feel free to squash.

TheBlueMatt · 2022-10-05T20:00:59Z

Squashed without further changes from e06229a1 to 37e54584

lightning/src/ln/channel.rs

jkczyz · 2022-10-17T22:23:24Z

lightning/src/ln/channel.rs

+		}
+		if self.cur_holder_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 &&
+			self.cur_counterparty_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 {
+			// If we're a 0-conf channel, and our commitment transaction numbers have both been


Not sure I understand why 0-conf is relevant here. Is the first paragraph of the comment describing how we would get inside the enclosing if? (Would prefer that the comment precede the if if that is the case.) And the second paragraph describing the assertion in the if immediately below?

Hmm, I tend to always include comments describing how we got into a scope in the scope itself, otherwise the flow is hard to read due to everything being at the same indentation. I rewrote the comment a bit to split up the "why we're here" and the "what we're asserting".

jkczyz · 2022-10-17T22:36:25Z

lightning/src/ln/chanmon_update_fail_tests.rs

+}
+
+fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) {
+	// Test that if the monitor update generated by funding_generated is stored async and we


funding_created?

was referring to the channelmanager method, clarified.

jkczyz · 2022-10-17T22:48:48Z

lightning/src/ln/chanmon_update_fail_tests.rs

+	do_test_outbound_reload_without_init_mon(false);
+}
+
+fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) {


Can much of this be DRY'ed up with do_test_outbound_reload_without_init_mon if node[0] and node[1] were swapped? Seems like the diff between the two functions is mostly but not entirely index changes. Not sure if a combined function would be more or less readable with an is_outbound param.

Hmm, so the first few hunks setting up the channel are the same, but then they diverge a good bit before doing a similar reload block. I guess we could DRY some of the initial stuff, but I'd really like to just DRY up all the reload stuff spewed across all our tests at once, see #1696. I'm not really sure its worth DRYing up the three or four hunks that are the same across the tests, either.

wpaulino · 2022-10-18T23:13:46Z

Ready to merge after squash @TheBlueMatt.

If the initial ChannelMonitor persistence is done asynchronously but does not complete before the node restarts (with a ChannelManager persistence), we'll start back up with a channel present but no corresponding ChannelMonitor. Because the Channel is pending-monitor-update and has not yet broadcasted its initial funding transaction or sent channel_ready, this is not a violation of our API contract nor a safety violation. However, the previous code would refuse to deserialize the ChannelManager treating it as an API contract violation. The solution is to test for this case explicitly and drop the channel entirely as if the peer disconnected before we received the funding_signed for outbound channels or before sending the channel_ready for inbound channels.

As we're moving towards monitor update async being a supported use-case, we shouldn't call an async monitor update "failed", but rather "in progress". This simply updates the internal channel.rs enum name to reflect the new thinking.

TheBlueMatt · 2022-10-19T14:42:02Z

Squashed without changes:

$ git diff-tree -U1 763ed22f 958601f1
$

TheBlueMatt added this to the 0.0.111 milestone Aug 20, 2022

TheBlueMatt added the blocked on dependent pr label Aug 20, 2022

TheBlueMatt mentioned this pull request Aug 22, 2022

Async KV Store Persister #1470

Open

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from 9ad3838 to ba2a2ba Compare August 24, 2022 01:45

TheBlueMatt mentioned this pull request Aug 24, 2022

Do not broadcast commitment txn on Permanent mon update failure #1106

Merged

TheBlueMatt modified the milestones: 0.0.111, 0.1 Aug 29, 2022

ariard reviewed Aug 29, 2022

View reviewed changes

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch 2 times, most recently from 1e90a4d to c7590e7 Compare September 29, 2022 22:11

TheBlueMatt added Seeking Code Review and removed blocked on dependent pr labels Sep 29, 2022

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from c7590e7 to c88eb35 Compare October 4, 2022 16:02

wpaulino reviewed Oct 4, 2022

View reviewed changes

wpaulino reviewed Oct 5, 2022

View reviewed changes

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from e06229a to 37e5458 Compare October 5, 2022 20:00

wpaulino previously approved these changes Oct 5, 2022

View reviewed changes

jkczyz reviewed Oct 17, 2022

View reviewed changes

TheBlueMatt dismissed wpaulino’s stale review via 763ed22 October 18, 2022 17:50

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from 37e5458 to 763ed22 Compare October 18, 2022 17:50

valentinewallace removed the Seeking Code Review label Oct 18, 2022

jkczyz previously approved these changes Oct 18, 2022

View reviewed changes

TheBlueMatt added 2 commits October 19, 2022 14:41

TheBlueMatt dismissed jkczyz’s stale review via 958601f October 19, 2022 14:41

TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from 763ed22 to 958601f Compare October 19, 2022 14:41

jkczyz approved these changes Oct 19, 2022

View reviewed changes

wpaulino approved these changes Oct 19, 2022

View reviewed changes

TheBlueMatt merged commit 89747dc into lightningdevkit:main Oct 19, 2022

Handle async initial ChannelMonitor persistence failing on restart #1678

Handle async initial ChannelMonitor persistence failing on restart #1678

Uh oh!

Conversation

TheBlueMatt commented Aug 20, 2022

Uh oh!

codecov-commenter commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Sep 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wpaulino left a comment

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Oct 5, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wpaulino commented Oct 18, 2022

Uh oh!

TheBlueMatt commented Oct 19, 2022

Uh oh!

Uh oh!

codecov-commenter commented Aug 24, 2022 •

edited

Loading

TheBlueMatt Oct 5, 2022 •

edited

Loading