(3/4) Conduit fixups from 494 #693

julianknutsen · 2020-09-14T23:34:26Z

PRs:
(1/4) Enum Dispatch NOISE State Machine & Unit Tests
(2/4) PeerManager Unit Tests and Refactoring
(3/4) Conduit fixups from 494
(4/4) Update OutboundQueue flush behavior and fix existing bugs that it found

History

Update-10-5-2020 Code handoff to @Arik and @ariard for next steps

Cleans up the remaining feedback regarding the Conduit behavior in the original #494 review.

New commits start @ 3ab11f2 this is a continuation of (2/4)

Decryption has been moved to the read() path to remove the complexity required to handle iterators that can error
The Decryptor will now decrypt directly from the input slice if there is no existing data in the read buffer. This should technically improve performance in some cases.
The Conduit object has been destructured into Encryptor & Decryptor.
process_act() now returns a simple container structure for the Encryptor/Decryptor/Remote Static Public Key to clean up the return value.

/// Container for the information returned from a successfully completed handshake
pub struct CompletedHandshakeInfo {
	pub decryptor: Decryptor,
	pub encryptor: Encryptor,
	pub their_node_id: PublicKey,
}

Bug Fixes

This also cleans up an invalid panic!() that would cause the library to crash in the event it received a read_event() payload larger than a single encrypted lightning message length. This is possible as a single read_event() can span multiple encrypted messages.

fix: Don't panic during valid decryption states

julianknutsen · 2020-09-15T19:32:23Z

lightning/src/ln/peers/encryption.rs

+/// Maximum Lightning message data length according to
+/// [BOLT-8](https://github.com/lightningnetwork/lightning-rfc/blob/v1.0/08-transport.md#lightning-message-specification)
+/// and [BOLT-1](https://github.com/lightningnetwork/lightning-rfc/blob/master/01-messaging.md#lightning-message-format):
+const LN_MAX_MSG_LEN: usize = ::std::u16::MAX as usize; // Must be equal to 65535


So, I went back to try and understand where this came from and am pretty sure this is a bug.

https://github.com/lightningnetwork/lightning-rfc/blob/v1.0/08-transport.md#lightning-message-specification

The BOLT defines the maximum lightning message length as 65535, but the maximum encrypted size is 65569 due to the length and tags. This fails in some weird ways and am working a follow-up PR to fix the bug and make it cleaner.

Done: fix: Don't panic during valid decryption states

codecov · 2020-09-15T19:55:04Z

Codecov Report

Merging #693 into main will increase coverage by 1.94%.
The diff coverage is 91.19%.

@@            Coverage Diff             @@
##             main     #693      +/-   ##
==========================================
+ Coverage   92.01%   93.95%   +1.94%     
==========================================
  Files          37       45       +8     
  Lines       20106    23036    +2930     
==========================================
+ Hits        18500    21643    +3143     
+ Misses       1606     1393     -213

Impacted Files	Coverage Δ
lightning/src/ln/peers/handler.rs	`97.43% <ø> (ø)`
lightning/src/ln/wire.rs	`89.71% <ø> (+21.71%)`	⬆️
lightning/src/ln/features.rs	`88.99% <50.00%> (-0.38%)`	⬇️
lightning/src/util/test_utils.rs	`75.44% <61.72%> (-9.54%)`	⬇️
lightning/src/ln/peers/test_util.rs	`89.10% <89.10%> (ø)`
lightning/src/ln/peers/transport.rs	`92.30% <92.30%> (ø)`
lightning/src/ln/peers/handshake/acts.rs	`94.64% <94.64%> (ø)`
lightning/src/ln/peers/encryption.rs	`95.12% <95.12%> (ø)`
lightning/src/ln/peers/chacha.rs	`95.65% <95.65%> (ø)`
lightning/src/ln/peers/handshake/mod.rs	`95.91% <95.91%> (ø)`
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f8e3...62820fe. Read the comment docs.

julianknutsen · 2020-09-18T17:57:28Z

Rebased against upstream.

This will make future changes that stabilize this work easier to review and merge. Co-authored-by: Arik Sosman <[email protected]> Co-authored-by: Julian Knutsen <[email protected]>

Lots of build warnings still, but it builds and tests pass.

Add unit tests against the PeerHandshake public interface prior to changing any of the implementation details. This is a solid stand-alone patch, even if the subsequent refactor and design changes are not merged. A future patch uses the test vectors from the RFC to minimize the code under test.

The implementation logic is duplicated from handshake/mod.rs so that it can be tested and refactored independent of the current implementation. This uses enum dispatch coupled with separately 'moved' state objects that allow for simple testing, easy state inspection, and understandable transition logic between states. This pattern also removes the need for match statements inside implementation logic in favor of the instance representing the current state and available data. Every state transition is implemented in a next() function and is a straight forward translation from an object representing the existing known data and an input act message (bytes). Example usage: let (act_data, next_state) = cur_state.next(input_data); This state machine will eventually be moved into place inside the PeerHandshake which will act as a simple wrapper that just maintains the state machine and marshals the data between the peer_handler and states.

PeerHandshake is now a simple object around the new state machine that simply marshals data between the callers and state machine states. For now, keep the APIs identical to make the patches easier to read. The only interesting piece to note here is the handling of get_remote_pubkey() to keep parity with the existing implementation. Future patches will remove this separate function in favor of callers taking ownership of it once the handshake is complete.

Introduce the sha256!() and concat!() macros that are used to improve readability of the noise exchanges.

Small inconvenience with the HandshakeHash return value that will be taken care of by the end of this patch stack.

Fix inconsistent use of concat vs. Sha256::engine() input. These can be deduplicated at the end.

Identify potential issue regarding the invalid case where the read_buffer has data after completion.

All needed public keys are calculated during the first state and passed through as needed. Limits the call sites of private_key_to_public and makes the initial states for the initiator and responder more symmetric.

Use the newly abstracted Transport layer to write unit tests for most of the PeerManager functionality. This uses dependency inversion with the PeerManager to pass in a Transport test double that is configured to exercise the interesting code paths. To enable this, a new struct PeerManagerImpl has been created that has all of the original functionality of PeerManager, but includes a type parameter for the Transport implementation that should be used. To keep this test-feature hidden from public docs, the PeerManager struct is now just a shim that delegates all the calls to PeerManagerImpl. This provides a nice balance of a clean public interface with the features needed to create isolated tests. The tests make use of the Spy and Stub test patterns to control input state and validate the output state.

Pass in all the in/out parameters individually instead of taking everything out of peer. The important thing to notice here is that the pending_outbound_buffer parameter MUST be an OutboundQueue and CANNOT be an 'impl PayloadQueuer'. This demonstrates that the read_event() function has the ability to flush() since it can use the SocketDescriptorFlusher interface. This should be illegal due to the invariant that read_event() never calls back into the SocketDescriptor. This is one of the motivations of using trait bounds in argument position as a defensive programing technique to prohibit functions from taking actions that shouldn't be allowed even if they have access to the object. Future patches will fix this bug and ensure the type system enforces this bug from happening again. This is a short-lived destructure that will be recombined once the dependencies are cleaned up.

Per the external documentation, there is an invariant that read_event() will not call back into the SocketDescriptor, but the actual code flushed the outbound queue. Fix the bug and use the type system to guarantee this behavior in the future. By changing the function parameters to use `impl PayloadQueuer` the function will only have access to functions on that interface throughout the execution, but still be able to pass the OutboundQueue object since it satisfies the trait bounds. Functions such as enqueue_message() also take a `impl PayloadQueuer` so they can be called from that context, but functions that need the SocketDescriptorFlusher interface will fail at compile-time and point to the issue before any tests need to be run. This also fixes up tests and the tokio implementation that did not implement the proper behavior and relied on read_event() calling send_data().

Introduce the MessageQueuer interface used by Transport to queue messages for send and use it in all the existing places where messages are queued.

Improves readability during review.

Many of the is_connected() checks were duplicative due to the macro continuing out of the match statement if the peer referenced by the node_id had not seen an Init message yet. Remove the macro in favor of a function on PeerHolder that will return an Option<> and use it instead. Split out a new helper function Peer::is_initialized() that will return true if the peer has seen an Init message.

Encapsulate the Peer information that is only valid after an Init message has been seen. This makes the accesses to sync_status, ping information, and their_features cleaner w.r.t. the Peer initialization state.

The previous implementation would send a peer_disconnect callback for any error after the NOISE handshake completed, but only send a peer_connected callback after the Init message was received. This could lead to a dangling peer_disconnected callback that didn't have a paired peer_connected. This patch cleans up the state differentiation to match the following cases: unconnected (!transport.is_connected() && peer.post_init_info.is_none()) * Know about the peer, but are still in the process of the NOISE handshake connected (transport.is_connected() && peer.post_init_info.is_none()) * NOISE handshake completed, but haven't received an Init message initialized (transport.is_connected() && peer.post_init_info.is_some()) * The NOISE handshake has completed and the Init message has been received and processed With the 3 conceptual states, the read_event() path now only inserts a Peer into the node_id_to_descriptor map after a Peer enters the initialized state. This fixes the asymmetry between peer_disconnected & peer_connected as well as simplifies the disconnect code.

Now that the Init handling has been moved and the dependencies are more clear, deduplicate the parameters of the read_event() and helper functions to make use of Peer. The overall pattern remains the same, read_event() does the locking and passes in the separate items (peer, peers_needing_send, node_id_to_descriptor) to do_read_event(). The handle_message() path is also cleaned up now that post_init_state is guaranteed to be valid at that point.

Now that the locking and destructuring is done in read_event(), the workarounds for the pre-NLL borrow checker can go away.

This patch expands the PeerHolder API allowing for encapsulation of the peer state from the code that needs to iterate over initialized peers. This cleans up peer iteration/removal allowing for a much simpler design. 1) Introduce new APIs for PeerHolder: * initialized_peers_mut() * initialized_peer_node_ids() * remove_peer_by_descriptor() 2) Clean up event handler iteration using new APIs 3) Unify the timer_tick and DisconnectPeer event disconnect path using new APIs 4) Convert get_peer_node_ids() to use new API

The broadcast event handling code did not catch Ok(false) returned from the route handler when deciding whether or not to broadcast messages. Instead, it only checked if the return value was an Error. Fix it up and enable the regression tests.

Use a separate lock to generate the SecretKey instead of overloading the PeerHolder mutex. Refactoring PeerManager removed this hidden constraint and this should make it more robust in the future.

s/fill_message_queue_with_sync/fill_outbound_queue_with_sync/ s/queue_init_message/enqueue_init_message/

Motivated by lightningdevkit#456, remove the peers_needing_send set in favor of just scanning the peers during process_events() and attempting to send data for those peers that have items in their outbound queue or pending sync items to be sent out.

Small review items that are noncontroversial and can be rolled into a single commit.

Catch a duplication connection from a peer with the same node_id at the completion of the NOISE handshake instead of the processing of the INIT message.

* Remove local/global naming when processing INIT message contents * Remove duplicate requires_unknown_bits() check

Reduce trait complexity in the OutboundQueue and Transport objects. Splitting the objects into separate traits allowed for cleaner separation of responsibilities. But in practice, this led to complexity in the PeerManager function signatures and confusion in why they needed to be separated. To move this module to a more maintainable state for the core development team, collapse PayloadQueuer/SocketDescriptorFlusher into a single IOutboundQueue trait. Also, remove MessageQueuer in favor of just passing in the entire Transport object. This makes the code look more similar to the rest of the codebase while still leveraging traits in the OutboundQueue and Transport layer that allow for test doubles and real unit tests.

This gets rid of the complexity required to handle Iterators that return errors and makes way for fewer copies in the decryption path.

Previously, all input data was written to the read buffer before any decryption was attempted. This patch will use the input data solely in the event that there is no previous data in the internal read buffer. This also converts all tests to use the public interface instead of the previously used decrypt_single_message This was design feedback from the original 494 review.

This makes the process_act() interface a bit cleaner on a successfully completed handshake.

The public interface was in a half-state with some callers referencing the Decryptor directly for message iteration and others using the public interface for the encryption path. This patch moves everything to using the Encryptor and Decryptor directly. This is motivated by feedback from 494, that recommended the objects are split up but still have common functions for the key rotation.

Now that the CompletedHandshakeInfo exists to pass the relevant pieces out of the handshake code and all users go to the Encryptor and Decryptor directly, this is no longer needed.

Clean up the loose comments, test names, and variables names that still referred to conduit now that it has been destructured.

introduced: 6185a28 The decryption path uses a read buffer to concatenate partial encrypted message payloads. The read buffer size can grow larger than LN_MAX_PACKET_LENGTH momentarily as the new bytes are added to the read buffer, but before decryption starts. Fix the invalid panic() and add a test

TheBlueMatt · 2020-10-04T22:31:57Z

lightning/src/ln/peers/encryption.rs

+const KEY_ROTATION_INDEX: u32 = 1000;
+
+/// Instantiate a new (Encryptor, Decryptor) with specified sending and receiving keys
+pub fn create_encryptor_decryptor(sending_key: SymmetricKey, receiving_key: SymmetricKey, chaining_key: SymmetricKey) -> (Encryptor, Decryptor) {


Why not just store the keys in CompletedHandshakeInfo in handshake instead of calling into encryption? This would decouple the two modules and remove the need to expose Encryptor/Decryptor.

This was just to preserve the original design choice of the handshake code instantiating the Conduit (now Encryptor/Decryptor).

The more this review goes on I'm realizing that the assumptions I had about #494 being reviewed/approved at least from a design perspective were naive. I spent a lot of time trying to keep the original design choices but likely should have just started over from the beginning.

I think that pattern makes a lot of sense. The handshake code will just export the keys and the Transport layer will instantiate an Encryptor/Decryptor with them.

I'll put that patch together in the next day or so.

This seemed a bit too obvious when I first read the feedback, but I couldn't put my finger on it. I went back through my stack to understand what happened and I wanted to share what I found for the next person that may fix this up.

The original design choice in 494 was to have the handshake code consume all bytes from the user unconditionally. In the event that the handshake was completed, but there was additional data, the bytes were transferred into the Decryptor (inside the handshake code) for later retrieval by the PeerManager. At that point in time, the PeerManager worked with the handshake code directly.

In order to move the Decryptor creation outside the handshake code, the likely design choice would be to have the handshake code return the number of bytes consumed and have the higher layers drain the input buffer. Then, when the handshake completes, it can instantiate the Decryptor with whatever bytes are left.

As the refactoring took place, it made sense to keep this design choice to limit the number of external changes to the code in order to reduce the feedback surface.

I had one outstanding patch that attempted to do this in the pre-Transport layering. Unfortunately, the code inside PeerManager was already very complicated and adding an additional "bytes read" stage to it without tests was too much complexity at the time. So, the design choice stayed.

However, now that there is more distinct layering this is definitely worth looking into again. The PeerManager code would still unconditionally hand off all bytes to the Transport layer and the Transport layer would handle the "bytes read" logic as well as the construction of the Decryptor.

Likely, this will be a much cleaner follow-on PR. This also makes sense since it doesn't really affect the PeerManager interface and is more of an implementation detail of the Transport<->Handshake code relationship.

TheBlueMatt · 2022-04-20T18:44:27Z

The author decided to move on some time ago, and redoing the great work here would likely require a substantial effort to rebase and address the regressions that were being worked on in the later PRs in this set when the author moved on. Going to close.

julianknutsen marked this pull request as ready for review September 15, 2020 00:31

This was referenced Sep 15, 2020

(1/4) Enum Dispatch NOISE State Machine & Unit Tests #691

Closed

(2/4) PeerManager Unit Tests and Refactoring #692

Closed

(4/4) Merge upstream/master #694

Closed

julianknutsen force-pushed the conduit-fixups branch from 6320463 to d02c540 Compare September 15, 2020 18:54

julianknutsen changed the title ~~(3/4) Conduit fixups from 494~~ (3/3) Conduit fixups from 494 Sep 15, 2020

julianknutsen commented Sep 15, 2020

View reviewed changes

julianknutsen force-pushed the conduit-fixups branch 4 times, most recently from da57106 to 5c4cda7 Compare September 18, 2020 01:54

julianknutsen changed the title ~~(3/3) Conduit fixups from 494~~ (3/4) Conduit fixups from 494 Sep 18, 2020

julianknutsen force-pushed the conduit-fixups branch from 5c4cda7 to f944c88 Compare September 18, 2020 17:56

julianknutsen force-pushed the conduit-fixups branch from f944c88 to e75fd44 Compare September 24, 2020 17:37

julianknutsen and others added 14 commits October 2, 2020 12:58

Squash changes from 494 into a single commit

244faa7

This will make future changes that stabilize this work easier to review and merge. Co-authored-by: Arik Sosman <[email protected]> Co-authored-by: Julian Knutsen <[email protected]>

Fix compilation bug in 494.

ba37213

Lots of build warnings still, but it builds and tests pass.

delete: Remove duplicate tests

52da509

refactor: Move helper functions to the states.rs

9445fea

refactor: Structure initialize_state based on RFC

a0d6690

Introduce the sha256!() and concat!() macros that are used to improve readability of the noise exchanges.

refactor: Structure calculate_act_message based on RFC

a68af2d

Small inconvenience with the HandshakeHash return value that will be taken care of by the end of this patch stack.

refactor: Structure process_act_message based on RFC

abf9241

Fix inconsistent use of concat vs. Sha256::engine() input. These can be deduplicated at the end.

refactor: Structure creation of Act3 based on RFC

3eb3146

Identify potential issue regarding the invalid case where the read_buffer has data after completion.

refactor: Structure receiving of Act3 data based on RFC

dbd1a21

refactor: Inline single use constructors

c88535a

refactor: Pull public key generation to the initial states

41a8ac0

All needed public keys are calculated during the first state and passed through as needed. Limits the call sites of private_key_to_public and makes the initial states for the initiator and responder more symmetric.

julianknutsen added 26 commits October 2, 2020 13:02

refactor: Continue constricting interfaces in message queue path

8ac0737

Introduce the MessageQueuer interface used by Transport to queue messages for send and use it in all the existing places where messages are queued.

review: Move PayloadQueuer trait move to a separate commit

9de58d8

Improves readability during review.

refactor: Introduce PostInitState

13b762b

Encapsulate the Peer information that is only valid after an Init message has been seen. This makes the accesses to sync_status, ping information, and their_features cleaner w.r.t. the Peer initialization state.

refactor: Remove scoping from do_read_event()

0359955

Now that the locking and destructuring is done in read_event(), the workarounds for the pre-NLL borrow checker can go away.

fix: Use a separate lock for the ephemeral_key_midstate

e316138

Use a separate lock to generate the SecretKey instead of overloading the PeerHolder mutex. Refactoring PeerManager removed this hidden constraint and this should make it more robust in the future.

rename: s/pending_outbound_buffer/outbound_queue/

8d5611c

s/fill_message_queue_with_sync/fill_outbound_queue_with_sync/ s/queue_init_message/enqueue_init_message/

Remove peers_needing_send set

b618350

Motivated by lightningdevkit#456, remove the peers_needing_send set in favor of just scanning the peers during process_events() and attempting to send data for those peers that have items in their outbound queue or pending sync items to be sent out.

review: Small nits, documentation, and fixups

4aababb

Small review items that are noncontroversial and can be rolled into a single commit.

review: Catch duplicate connection after NOISE before INIT

93a70ce

Catch a duplication connection from a peer with the same node_id at the completion of the NOISE handshake instead of the processing of the INIT message.

review: Remove legacy (local/global) feature flag notation

e5268bb

* Remove local/global naming when processing INIT message contents * Remove duplicate requires_unknown_bits() check

Move Conduit decryption into read()

3ab11f2

This gets rid of the complexity required to handle Iterators that return errors and makes way for fewer copies in the decryption path.

refactor: Use a real struct instead of conduit_and_remote_pubkey tuple

d5573ed

This makes the process_act() interface a bit cleaner on a successfully completed handshake.

refactor: Remove Conduit object

59728dc

Now that the CompletedHandshakeInfo exists to pass the relevant pieces out of the handshake code and all users go to the Encryptor and Decryptor directly, this is no longer needed.

refactor: Rename conduit.rs and clean up remaining references

f4c76a9

Clean up the loose comments, test names, and variables names that still referred to conduit now that it has been destructured.

julianknutsen force-pushed the conduit-fixups branch from e75fd44 to 62820fe Compare October 2, 2020 20:09

TheBlueMatt reviewed Oct 5, 2020

View reviewed changes

TheBlueMatt closed this Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(3/4) Conduit fixups from 494 #693

(3/4) Conduit fixups from 494 #693

Uh oh!

julianknutsen commented Sep 14, 2020 •

edited

Loading

Uh oh!

julianknutsen Sep 15, 2020

Uh oh!

julianknutsen Oct 2, 2020

Uh oh!

codecov bot commented Sep 15, 2020 •

edited

Loading

Uh oh!

julianknutsen commented Sep 18, 2020

Uh oh!

TheBlueMatt Oct 4, 2020

Uh oh!

julianknutsen Oct 5, 2020 •

edited

Loading

Uh oh!

julianknutsen Oct 9, 2020

Uh oh!

TheBlueMatt commented Apr 20, 2022

Uh oh!

Uh oh!

(3/4) Conduit fixups from 494 #693

(3/4) Conduit fixups from 494 #693

Uh oh!

Conversation

julianknutsen commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

History

Bug Fixes

Uh oh!

julianknutsen Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

julianknutsen Oct 2, 2020

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

julianknutsen commented Sep 18, 2020

Uh oh!

TheBlueMatt Oct 4, 2020

Choose a reason for hiding this comment

Uh oh!

julianknutsen Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julianknutsen Oct 9, 2020

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt commented Apr 20, 2022

Uh oh!

Uh oh!

julianknutsen commented Sep 14, 2020 •

edited

Loading

codecov bot commented Sep 15, 2020 •

edited

Loading

julianknutsen Oct 5, 2020 •

edited

Loading