Skip to content

(4/4) Update OutboundQueue flush behavior and fix existing bugs that it found #714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 82 commits into from

Conversation

julianknutsen
Copy link

@julianknutsen julianknutsen commented Sep 18, 2020

History

  • Update-10-5-2020 Code handoff to @Arik and @ariard for next steps

New commits start here: fb3dcf0

Make the OutboundQueue flush operation a bit cleaner and look more like a batch operation.

This is a behavior change, but it should lead to performance improvements in the case the sync callbacks to the RoutingMessageHandler were slow. Previously, when the queue was full a single sync operation was enqueued at a time.

This PR also addresses two bugs that were found in the testing of this API update.

test fix: Fix TestRoutingMessageHandler get_next_channel_announcements()
fix: Don't exceed queue soft limit with sync messages

@codecov
Copy link

codecov bot commented Sep 18, 2020

Codecov Report

Merging #714 into main will increase coverage by 1.92%.
The diff coverage is 91.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #714      +/-   ##
==========================================
+ Coverage   92.01%   93.94%   +1.92%     
==========================================
  Files          37       45       +8     
  Lines       20106    23041    +2935     
==========================================
+ Hits        18500    21645    +3145     
+ Misses       1606     1396     -210     
Impacted Files Coverage Δ
lightning/src/ln/peers/handler.rs 97.43% <ø> (ø)
lightning/src/ln/wire.rs 89.71% <ø> (+21.71%) ⬆️
lightning/src/ln/features.rs 88.99% <50.00%> (-0.38%) ⬇️
lightning/src/util/test_utils.rs 75.44% <62.42%> (-9.54%) ⬇️
lightning/src/ln/peers/test_util.rs 89.10% <89.10%> (ø)
lightning/src/ln/peers/transport.rs 92.30% <92.30%> (ø)
lightning/src/ln/peers/handshake/acts.rs 94.64% <94.64%> (ø)
lightning/src/ln/peers/encryption.rs 95.12% <95.12%> (ø)
lightning/src/ln/peers/chacha.rs 95.65% <95.65%> (ø)
lightning/src/ln/peers/handshake/mod.rs 95.91% <95.91%> (ø)
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f8e3...4ea45dd. Read the comment docs.

@julianknutsen
Copy link
Author

Rebased against upstream

julianknutsen and others added 22 commits October 2, 2020 12:58
This will make future changes that stabilize this work easier to review
and merge.

Co-authored-by: Arik Sosman <[email protected]>
Co-authored-by: Julian Knutsen <[email protected]>
Lots of build warnings still, but it builds and tests pass.
Add unit tests against the PeerHandshake public interface prior to
changing any of the implementation details.

This is a solid stand-alone patch, even if the subsequent refactor and
design changes are not merged. A future patch uses the test vectors from
the RFC to minimize the code under test.
The implementation logic is duplicated from handshake/mod.rs so that it
can be tested and refactored independent of the current implementation.

This uses enum dispatch coupled with separately 'moved' state objects
that allow for simple testing, easy state inspection, and understandable
transition logic between states. This pattern also removes the need for
 match statements inside implementation logic in favor of the instance
representing the current state and available data.

Every state transition is implemented in a next() function and is a
straight forward translation from an object representing the existing
known data and an input act message (bytes).

Example usage:

let (act_data, next_state) = cur_state.next(input_data);

This state machine will eventually be moved into place inside the
PeerHandshake which will act as a simple wrapper that just maintains the
state machine and marshals the data between the peer_handler and states.
PeerHandshake is now a simple object around the new state machine that
simply marshals data between the callers and state machine states. For
now, keep the APIs identical to make the patches easier to read.

The only interesting piece to note here is the handling of
get_remote_pubkey() to keep parity with the existing implementation.
Future patches will remove this separate function in favor of callers
taking ownership of it once the handshake is complete.
Introduce the sha256!() and concat!() macros that are used to
improve readability of the noise exchanges.
Small inconvenience with the HandshakeHash return value that will be
taken care of by the end of this patch stack.
Fix inconsistent use of concat vs. Sha256::engine() input. These can
be deduplicated at the end.
Identify potential issue regarding the invalid case where
the read_buffer has data after completion.
All needed public keys are calculated during the first
state and passed through as needed. Limits the call sites
of private_key_to_public and makes the initial states
for the initiator and responder more symmetric.
Create an alias type to make the code more readable and change all
usages to moves.
Push buffer read into helper function
Combines static length values of ActOne and ActTwo structs
All consumers immediately serialize the struct so just
do it for them and hide the implementation details of the
handshake.
Now that final process_act() returns (conduit, remote_pubkey), the
callers can just receive it when the handshake is complete.
Now that the remote_pubkey is returned from process_act(), the logic
can be more straightforward. Each branch is responsible
for returning the next PeerState and PeerDataProcessingDecision.

Remove Option from CompleteHandshake now that the PublicKey is
guaranteed to exist
Clean up loose threads (wrong docs, variable names) now that this code
is in a more stable state.
Now that the implementation is complete and swapped over, remove
the duplicated tests for process_act(). At this layer, all that
needs to be tested is:

* The new state in PeerHandshake is set to the state returned from
state_machine.next()
* Any error generated from state_machine.next() is returned
* Any calls to process_next() after an error was returned panic

These tests are a good use case for a mocking library, but for now
just leverage implementation details to trigger the proper errors
and states.
Introduce the MessageQueuer interface used by Transport to queue
messages for send and use it in all the existing places where messages
are queued.
Many of the is_connected() checks were duplicative due to the macro
continuing out of the match statement if the peer referenced by the
node_id had not seen an Init message yet.

Remove the macro in favor of a function on PeerHolder that will return
an Option<> and use it instead.

Split out a new helper function Peer::is_initialized() that will return
true if the peer has seen an Init message.
Encapsulate the Peer information that is only valid after an Init
message has been seen. This makes the accesses to sync_status, ping
information, and their_features cleaner w.r.t. the Peer initialization
state.
The previous implementation would send a peer_disconnect callback for
any error after the NOISE handshake completed, but only send a
peer_connected callback after the Init message was received. This could
lead to a dangling peer_disconnected callback that didn't have a paired
peer_connected.

This patch cleans up the state differentiation to match the following
cases:

unconnected (!transport.is_connected() && peer.post_init_info.is_none())
* Know about the peer, but are still in the process of the NOISE
handshake

connected (transport.is_connected() && peer.post_init_info.is_none())
* NOISE handshake completed, but haven't received an Init message

initialized (transport.is_connected() && peer.post_init_info.is_some())
* The NOISE handshake has completed and the Init message has been
received and processed

With the 3 conceptual states, the read_event() path now only inserts
a Peer into the node_id_to_descriptor map after a Peer enters the
initialized state.

This fixes the asymmetry between peer_disconnected & peer_connected as
well as simplifies the disconnect code.
Now that the Init handling has been moved and the dependencies are more
clear, deduplicate the parameters of the read_event() and helper
functions to make use of Peer.

The overall pattern remains the same, read_event() does the locking and
passes in the separate items (peer, peers_needing_send,
node_id_to_descriptor) to do_read_event().

The handle_message() path is also cleaned up now that post_init_state is
guaranteed to be valid at that point.
Now that the locking and destructuring is done in read_event(), the
workarounds for the pre-NLL borrow checker can go away.
This patch expands the PeerHolder API allowing for encapsulation of
the peer state from the code that needs to iterate over initialized
peers. This cleans up peer iteration/removal allowing for a much
simpler design.

1) Introduce new APIs for PeerHolder:
* initialized_peers_mut()
* initialized_peer_node_ids()
* remove_peer_by_descriptor()

2) Clean up event handler iteration using new APIs
3) Unify the timer_tick and DisconnectPeer event
   disconnect path using new APIs
4) Convert get_peer_node_ids() to use new API
The broadcast event handling code did not catch Ok(false) returned from
the route handler when deciding whether or not to broadcast messages.
Instead, it only checked if the return value was an Error.

Fix it up and enable the regression tests.
Use a separate lock to generate the SecretKey instead of overloading the
PeerHolder mutex.

Refactoring PeerManager removed this hidden constraint and this should
make it more robust in the future.
s/fill_message_queue_with_sync/fill_outbound_queue_with_sync/
s/queue_init_message/enqueue_init_message/
Motivated by lightningdevkit#456, remove the
peers_needing_send set in favor of just scanning the peers during
process_events() and attempting to send data for those peers that have items in
their outbound queue or pending sync items to be sent out.
Small review items that are noncontroversial and can be rolled into a single commit.
Catch a duplication connection from a peer with the same node_id at the completion
of the NOISE handshake instead of the processing of the INIT message.
* Remove local/global naming when processing INIT message
contents
* Remove duplicate requires_unknown_bits() check
Reduce trait complexity in the OutboundQueue and Transport objects.
Splitting the objects into separate traits allowed for cleaner separation
of responsibilities. But in practice, this led to complexity in the
PeerManager function signatures and confusion in why they needed to be separated.

To move this module to a more maintainable state for the core
development team, collapse PayloadQueuer/SocketDescriptorFlusher into a
single IOutboundQueue trait.

Also, remove MessageQueuer in favor of just passing in the entire
Transport object.

This makes the code look more similar to the rest of the codebase while
still leveraging traits in the OutboundQueue and Transport layer that
allow for test doubles and real unit tests.
This gets rid of the complexity required to handle Iterators that
return errors and makes way for fewer copies in the decryption path.
Previously, all input data was written to the read buffer before
any decryption was attempted. This patch will use the input data solely in
the event that there is no previous data in the internal read buffer.

This also converts all tests to use the public interface instead of
the previously used decrypt_single_message

This was design feedback from the original 494 review.
This makes the process_act() interface a bit cleaner on a successfully
completed handshake.
The public interface was in a half-state with some callers referencing
the Decryptor directly for message iteration and others using the
public interface for the encryption path. This patch moves everything to
using the Encryptor and Decryptor directly.

This is motivated by feedback from 494, that recommended the objects
are split up but still have common functions for the key rotation.
Now that the CompletedHandshakeInfo exists to pass the relevant pieces
out of the handshake code and all users go to the Encryptor and
Decryptor directly, this is no longer needed.
Clean up the loose comments, test names, and variables names that still
referred to conduit now that it has been destructured.
introduced: 6185a28

The decryption path uses a read buffer to concatenate partial encrypted
message payloads. The read buffer size can grow larger than
LN_MAX_PACKET_LENGTH momentarily as the new bytes are added to the read
buffer, but before decryption starts.

Fix the invalid panic() and add a test
introduced: 3e0aed2

The existing implementation of the TestRoutingMessageHandler test stub
had two bugs:

* Conditions where the range end point would be less than the range
  start point resulting in pushed values when they weren't expected
* Test stub did not return TOTAL_UPDS total updates. Instead it
  returned half of them due to the calculation of the range.
The do_attempt_write_data() function would previously only flush
one item at a time. This would cause non-optimal batching in cases
where the queue was full and the RoutingMessageHandler was only asked
for a single item.

Instead, change the API to flush the entire contents of the queue.
introduced: 8d7b498

The current implementation will add 12 items to a queue with a
soft_limit of 10. Fix it to match the expected behavior which is that
sync messages do not fill the queue above the soft limit.
@TheBlueMatt
Copy link
Collaborator

The author decided to move on some time ago, and redoing the great work here would likely require a substantial effort to rebase and address the regressions that were being worked on in the later PRs in this set when the author moved on. Going to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants