Skip to content

Handle async initial ChannelMonitor persistence failing on restart #1678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

TheBlueMatt
Copy link
Collaborator

Based on #1106, this is the second of many steps towards making the async monitor persistence feature more robust, eventually leading to 0.1 and a safe async KV store API.

If the initial ChannelMonitor persistence is done asynchronously
but does not complete before the node restarts (with a
ChannelManager persistence), we'll start back up with a channel
present but no corresponding ChannelMonitor.

Because the Channel is pending-monitor-update and has not yet
broadcasted its initial funding transaction, this is not a
violation of our API contract nor a safety violation. However, the
previous code would refuse to deserialize the ChannelManager
treating it as an API contract violation.

The solution is to test for this case explicitly and drop the
channel entirely as if the peer disconnected before we received
the funding_signed.

@TheBlueMatt TheBlueMatt added this to the 0.0.111 milestone Aug 20, 2022
@TheBlueMatt TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from 9ad3838 to ba2a2ba Compare August 24, 2022 01:45
@codecov-commenter
Copy link

codecov-commenter commented Aug 24, 2022

Codecov Report

Base: 90.73% // Head: 91.17% // Increases project coverage by +0.44% 🎉

Coverage data is based on head (763ed22) compared to base (7544030).
Patch coverage: 97.12% of modified lines in pull request are covered.

❗ Current head 763ed22 differs from pull request most recent head 958601f. Consider uploading reports for the commit 958601f to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1678      +/-   ##
==========================================
+ Coverage   90.73%   91.17%   +0.44%     
==========================================
  Files          87       87              
  Lines       46713    49843    +3130     
  Branches    46713    49843    +3130     
==========================================
+ Hits        42383    45444    +3061     
- Misses       4330     4399      +69     
Impacted Files Coverage Δ
lightning/src/util/events.rs 38.13% <ø> (ø)
lightning/src/ln/chanmon_update_fail_tests.rs 97.67% <96.96%> (-0.06%) ⬇️
lightning/src/ln/channel.rs 88.67% <97.22%> (+0.01%) ⬆️
lightning/src/ln/channelmanager.rs 85.18% <100.00%> (+0.02%) ⬆️
lightning-net-tokio/src/lib.rs 76.73% <0.00%> (-0.31%) ⬇️
lightning/src/ln/functional_tests.rs 96.93% <0.00%> (-0.12%) ⬇️
lightning/src/ln/monitor_tests.rs 99.44% <0.00%> (-0.12%) ⬇️
lightning/src/ln/msgs.rs 86.41% <0.00%> (+0.17%) ⬆️
lightning/src/routing/scoring.rs 96.34% <0.00%> (+0.20%) ⬆️
lightning/src/ln/chan_utils.rs 95.32% <0.00%> (+0.76%) ⬆️
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More parse to be done to check accuracy of code and documentation. Nice the documentation effort towards users!

/// secret we gave them that they shouldn't know).
///
/// Broadcasting these transactions in the second case is UNSAFE, as they allow counterparty
/// side to punish you. Nevertheless you may want to broadcast them if counterparty doesn't
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment, you might even add "YOU WILL LOSS THE INTEGRALITY OF YOUR CHANNEL BALANCE" to make even clearer that punishment induce loss of the whole funds. It can be hard for an uninformed user to estimate the punishment scope at first sight.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is #1106, see discussion there. Ultimately the issue is the docs were rewritten there assuming we're sticking with the "make sure you always store the first monitor locally and update it always" model, but we really want to drop that model (which this PR is the first step towards), so we're gonna have to rewrite those docs again soon :(.

/// or a remote copy of this [`ChannelMonitor`] is no longer reachable and thus not updatable).
///
/// When this is returned, [`ChannelManager`] will force-close the channel but *not* broadcast
/// our current commitment transaction. This avoids a dangerous case where a local disk failure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the future a ChannelManager could still decide to force-close the channel by broadcasting the current commitment transaction if there is a non-related or recoverable disk failure . Let's say you have a chain::Watch implementation coordinating a set of monitor replica (as documented in our glossary). If one of this monitor's Persist implementation is affected by a disk-failure, the error will be reported to the coordinator. If the number of monitor replica affected by a permanent failure is above some security or safety threshold, the coordinator could inform the ChannelManager to force-close channels by prevention (e.g to avoid the in-flight HTLCs exposure to increase). It should be safe to do so, as the latest state available in the available monitor replica should be equivalent to the latest off-chain one.

Just to say, it's good to have that level of documentation at the interface-level, however it's only documenting one model of deployment. As we refine those interfaces to support more fancy types of deployment, I think it would be good to gather it elsewhere, more accessible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I don't think that necessarily belongs in the interface directly, though. Users who wish to use a fancier model can manually broadcast at that point. We should, of course, support this, and document it, but I'm not sure the code needs to change. In any case, I basically want to rewrite all the monitor stuff soon-ish, soooo

/// [`PermanentFailure`]: ChannelMonitorUpdateResult::PermanentFailure
/// [`UpdateInProgress`]: ChannelMonitorUpdateResult::UpdateInProgress
/// [`ChannelManager`]: crate::ln::channelmanager::ChannelManager
UpdateInProgress,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might note the trade-off brought by UpdateInProgress, if you allow the channel manager to keep moving on its own without state being fully persisted you gain in latency at the HTLC relay-level. However, you're increasing your risks w.r.t to state replication. This trade-off might be acceptable for some routing hops node operators if in the future scoring algorithms starts to encompass latency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, probably worth mentioning in general, not specific to just the async stuff.

@ariard ariard removed the Review club label Sep 7, 2022
@TheBlueMatt TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch 2 times, most recently from 1e90a4d to c7590e7 Compare September 29, 2022 22:11
@TheBlueMatt
Copy link
Collaborator Author

Rebased after we landed #1106. Now the real work begins :)

@TheBlueMatt TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from c7590e7 to c88eb35 Compare October 4, 2022 16:02

let persister: test_utils::TestPersister;
let new_chain_monitor: test_utils::TestChainMonitor;
let nodes_0_deserialized: ChannelManager<EnforcingSigner, &test_utils::TestChainMonitor, &test_utils::TestBroadcaster, &test_utils::TestKeysInterface, &test_utils::TestFeeEstimator, &test_utils::TestLogger>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to make a TestChannelManager type alias for this, we have this same type declaration in several places.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it as a part of #1696?

nodes[0].chain_source.watched_txn.lock().unwrap().clear();
nodes[0].chain_source.watched_outputs.lock().unwrap().clear();

let nodes_0_serialized = nodes[0].node.encode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the ChannelManager reloading into a macro/function? This seems like something super useful to use across tests and would reduce a good bit of code.

Copy link
Collaborator Author

@TheBlueMatt TheBlueMatt Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, #1696. I'm not really excited about doing it here - this PR is a part of the bigger 0.1 series which is nontrivial and needs to keep moving.

nodes[0].node = &nodes_0_deserialized;
assert!(nodes_0_read.is_empty());

check_closed_event!(nodes[0], 1, ClosureReason::DisconnectedPeer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check that the counterparty also forgets the channel here and in the other test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't forget the channel, though. Or, at least, it only will if it connects and gets the error-unknown channel message. Is it worth adding a specific test for that here? (I did add a check-list_channels-is-empty check).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not required for the sake of what we're testing, but we could either do that or connect blocks until the 2016 block deadline is reached.

Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, feel free to squash.

@TheBlueMatt TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from e06229a to 37e5458 Compare October 5, 2022 20:00
@TheBlueMatt
Copy link
Collaborator Author

Squashed without further changes from e06229a1 to 37e54584

wpaulino
wpaulino previously approved these changes Oct 5, 2022
}
if self.cur_holder_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 &&
self.cur_counterparty_commitment_transaction_number == INITIAL_COMMITMENT_NUMBER - 1 {
// If we're a 0-conf channel, and our commitment transaction numbers have both been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why 0-conf is relevant here. Is the first paragraph of the comment describing how we would get inside the enclosing if? (Would prefer that the comment precede the if if that is the case.) And the second paragraph describing the assertion in the if immediately below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I tend to always include comments describing how we got into a scope in the scope itself, otherwise the flow is hard to read due to everything being at the same indentation. I rewrote the comment a bit to split up the "why we're here" and the "what we're asserting".

}

fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) {
// Test that if the monitor update generated by funding_generated is stored async and we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funding_created?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was referring to the channelmanager method, clarified.

do_test_outbound_reload_without_init_mon(false);
}

fn do_test_inbound_reload_without_init_mon(use_0conf: bool, lock_commitment: bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can much of this be DRY'ed up with do_test_outbound_reload_without_init_mon if node[0] and node[1] were swapped? Seems like the diff between the two functions is mostly but not entirely index changes. Not sure if a combined function would be more or less readable with an is_outbound param.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so the first few hunks setting up the channel are the same, but then they diverge a good bit before doing a similar reload block. I guess we could DRY some of the initial stuff, but I'd really like to just DRY up all the reload stuff spewed across all our tests at once, see #1696. I'm not really sure its worth DRYing up the three or four hunks that are the same across the tests, either.

jkczyz
jkczyz previously approved these changes Oct 18, 2022
@wpaulino
Copy link
Contributor

Ready to merge after squash @TheBlueMatt.

If the initial ChannelMonitor persistence is done asynchronously
but does not complete before the node restarts (with a
ChannelManager persistence), we'll start back up with a channel
present but no corresponding ChannelMonitor.

Because the Channel is pending-monitor-update and has not yet
broadcasted its initial funding transaction or sent channel_ready,
this is not a violation of our API contract nor a safety violation.
However, the previous code would refuse to deserialize the
ChannelManager treating it as an API contract violation.

The solution is to test for this case explicitly and drop the
channel entirely as if the peer disconnected before we received
the funding_signed for outbound channels or before sending the
channel_ready for inbound channels.
As we're moving towards monitor update async being a supported
use-case, we shouldn't call an async monitor update "failed", but
rather "in progress". This simply updates the internal channel.rs
enum name to reflect the new thinking.
@TheBlueMatt TheBlueMatt force-pushed the 2022-08-funding-locked-mon-persist-fail branch from 763ed22 to 958601f Compare October 19, 2022 14:41
@TheBlueMatt
Copy link
Collaborator Author

Squashed without changes:

$ git diff-tree -U1 763ed22f 958601f1
$

@TheBlueMatt TheBlueMatt merged commit 89747dc into lightningdevkit:main Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants