-
Notifications
You must be signed in to change notification settings - Fork 407
(2/4) PeerManager Unit Tests and Refactoring #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(2/4) PeerManager Unit Tests and Refactoring #692
Conversation
2d0e53e
to
16bdd61
Compare
Codecov Report
@@ Coverage Diff @@
## main #692 +/- ##
==========================================
+ Coverage 91.91% 93.54% +1.63%
==========================================
Files 36 44 +8
Lines 20161 22238 +2077
==========================================
+ Hits 18531 20803 +2272
+ Misses 1630 1435 -195
Continue to review full report at Codecov.
|
a18f7a2
to
461f3be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed until f008fd2, including the bug fix.
Mostly call for better documentation. I don't think there is performance regression concern due the new interface.
When it's all over a gentle documentation in src/ln/peers/mod.rs
would be great to explain relations between interfaces and sum up design goals.
lightning/src/ln/peers/handler.rs
Outdated
@@ -40,6 +40,9 @@ use bitcoin::hashes::sha256::HashEngine as Sha256Engine; | |||
use bitcoin::hashes::{HashEngine, Hash}; | |||
use ln::peers::handshake::PeerHandshake; | |||
use ln::peers::conduit::Conduit; | |||
use ln::peers::outbound_queue::OutboundQueue; | |||
|
|||
const MSG_BUFF_SIZE: usize = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about how this const is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
@@ -120,6 +123,64 @@ enum InitSyncTracker{ | |||
NodesSyncing(PublicKey), | |||
} | |||
|
|||
/// Trait representing a container that allows enqueuing of Vec<[u8]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond what this is trait is representing, maybe you can explain how it should be used, namely enqueuing pending messages to send to a peer. Precise this interface is unidirectional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
} | ||
|
||
/// Trait representing a container that can try to flush data through a SocketDescriptor | ||
pub(super) trait SocketDescriptorFlusher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here precise how the caller should use this interface. Maybe you can add a comment explaining design goals to have splitted interfaces for an outbound queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
|
||
while !peer.pending_outbound_buffer.is_blocked() { | ||
let queue_space = peer.pending_outbound_buffer.queue_space(); | ||
if queue_space > 0 { | ||
match peer.sync_status { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a side-note, but I think in the future we would like to move the sync status inside NetGraphMsgHandler
. IMO this logic doesn't belong to the peer handler as it's already concerned with processing state.
It would be easier to implement more-agressive syncing logic like fetching valid node_announcement
and start thinking with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree. This is a gross design choice and I did the minimal abstraction to test the OutboundQueue while not introducing behavior changes. It is very awkward to use right now and points to a code smell.
There is a story in the backlog (currently moving into GitHub issues) where I've even called out this particular item as extremely hard to unit tests in the current design.
lightning/src/ln/peers/handler.rs
Outdated
/// Write previously enqueued data to the SocketDescriptor. A return of false indicates the | ||
/// underlying SocketDescriptor could not fulfill the send_data() call and the blocked state | ||
/// has been set. Use unblock() when the SocketDescriptor may have more room. | ||
fn try_flush_one(&mut self, descriptor: &mut impl SocketDescriptor) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name is a bit confusing, try flush one what ? It could be a byte, the whole buffer, a unit of the underlying container data structure... Maybe try_flush_all
it's more explicit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another artifact of the current code doing something strange. I'll rename it to be try_flush_one_payload
if you think that makes it more obvious. It flushes one item that was previously enqueued which is how the name came about. Super flexible.
This should obviously move to a flush_all() behavior. It makes the sync message generation better too. But, outside of this PR as this is just refactoring and documenting current behavior so the right path can be identified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put together a patch to change the API to flush()
and it flushes the entire queue. It is in a separate PR found here: #714
Unfortunately, this had to go as a separate PR at the end as it encountered an existing bug (5c4cda7) fixed in 3/3. Luckily, it also found 2 more exiting bugs so the quality bar is going up. Let me know what you think.
lightning/src/ln/peers/handler.rs
Outdated
@@ -123,126 +187,8 @@ enum InitSyncTracker{ | |||
NodesSyncing(PublicKey), | |||
} | |||
|
|||
/// Trait representing a container that allows enqueuing of Vec<[u8]> | |||
pub(super) trait PayloadQueuer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this code move in a move-only commit to ease review ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Move PayloadQueuer trait move to a separate commit
fn process_act(&mut self, input: &[u8]) -> Result<(Option<Vec<u8>>, Option<(Conduit, PublicKey)>), String>; | ||
} | ||
|
||
pub(super) struct Transport<PeerHandshakeImpl: IPeerHandshake=PeerHandshake> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment, wrapper around Conduit
and PeerHandshakeImpl
, is used by Peer to provide access to an authenticated and encrypted transport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment here. I'd like to stay away from comments that talk about what items are contained inside and more about the features they provide. Let me know if my addition seems reasonable to you.
Done: review: Small nits, documentation, and fixups
@@ -30,13 +31,6 @@ pub(super) enum HandshakeState { | |||
Complete(Option<(Conduit, PublicKey)>), | |||
} | |||
|
|||
// Trait for all individual states to implement that ensure HandshakeState::next() can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it's easy to have this move in its own commit, you're welcome to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Move IHandshakeState code move to separate patch
@@ -60,6 +60,9 @@ pub(super) trait ITransport { | |||
/// Returns true if the connection is established and encrypted messages can be sent. | |||
fn is_connected(&self) -> bool; | |||
|
|||
/// Returns all Messages that have been received and can be parsed by the Transport | |||
fn drain_messages<L: Deref>(&mut self, logger: L) -> Result<Vec<Message>, PeerHandleError> where L::Target: Logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"can be parsed" shouldn't this be "successfully parsed", as those messages are extracted by the caller from the Transport object ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return a Vec
and not an Iterator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Decryptor
already implements Iterator, so I presume this would be doable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this got lost in the noise.
This was the design chosen in #494 and I chose to limit the number of behavior changes in the initial refactoring to ease the review (although I'm not sure how much that really saved in the end). I've done my best to separate out behavior changes to a separate patch so they can be easily dropped if necessary.
This could just as easily be a drain iterator or an iterator followed by clear() or drain(). I don't really like the pattern of iterators that also remove, so this was the cleanest analog. I don't think there is any additional memory pressure here since the bytes exist in the Decryptor anyway and they are effectively transferred to this layer.
But, it seems reasonable to change this up in the future if there is a valid reason. An iterator/drain pattern would mesh well with my thoughts in #698.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading it wrong in transport.rs
? It looks like it processes bytes, creating messages and copying them onto the heap (in the vec), possible with reallocation (ie copies) to grow the vec. An iterator/drain iterator would just deserialize directly onto the stack and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is probably worth looking at the code after 2b8d24d and recommending your changes on top of it. At that point, the code smells around iterators that can return errors and the feature to process the undecrypted message directly from the input slice has been implemented.
The original code from 494 had a bug in that it just asserted on all errors. The initial patches fixed this up to follow a more standard iterator-that-can-error pattern, but as you will see looking at it, it is definitely a code smell and hard to maintain.
Future patches culminating in 2b8d24d made this much easier to test and reason about. If you aren't happy with the performance characteristics that is probably a good place to talk about the tradeoffs so that the right design/fix can go in.
lightning/src/ln/peers/handler.rs
Outdated
@@ -419,7 +419,6 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref> PeerManager<D | |||
($msg: expr) => { | |||
{ | |||
log_trace!(self.logger, "Encoding and sending sync update message of type {} to {}", $msg.type_id(), log_pubkey!(peer.their_node_id.unwrap())); | |||
assert!(peer.transport.is_connected()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A code suggestion, delete here and instead in enqueue_message()
.
assert!(self.conduit.is_some())
if let Some(conduit) = self.conduit {
// push back encrypted buffer on output queue
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that is why I removed it. At this point, Transport::enqueue_message()
panics already.
fn enqueue_message<M: Encode + Writeable, Q: PayloadQueuer>(&mut self, message: &M, output_buffer: &mut Q) {
match self.conduit {
None => panic!("Enqueueing messages only supported after transport is connected"),
Some(ref mut conduit) => {
let mut buffer = VecWriter(Vec::new());
wire::write(message, &mut buffer).unwrap();
output_buffer.push_back(conduit.encrypt(&buffer.0));
}
}
}
461f3be
to
72a5400
Compare
Thanks for the progress, @ariard. I've addressed all comments and pushed a version with your requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed up until 461f3be, you can partially disregard some comments as review was in-order per-commit.
Thanks for the lot of bug fixes. I don't think there is performance regression. What I want to think for next round if this design may block us for future perf improvements (e.g a pool-thread for encryption, but even this not sure as that's symmetric crypto post-handshake which should be cheap).
lightning/src/ln/peers/handler.rs
Outdated
|
||
/// Interface PeerManager uses to queue message to send. Used primarily to restrict the interface in | ||
/// specific contexts. e.g. Only queueing during read_event(). No flushing allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this commit 0f7367b, it's not used in read_event()
which is still using transport ? If it's the upper stack of patch disregard this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
@@ -181,6 +181,13 @@ struct Peer<TransportImpl: ITransport> { | |||
} | |||
|
|||
impl<TransportImpl: ITransport> Peer<TransportImpl> { | |||
|
|||
/// Returns true if an INIT message has been received from this peer. Implies that this node | |||
/// can send and receive encrypted messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add (is_connected()==true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
@@ -214,6 +221,30 @@ struct PeerHolder<Descriptor: SocketDescriptor, TransportImpl: ITransport> { | |||
node_id_to_descriptor: HashMap<PublicKey, Descriptor>, | |||
} | |||
|
|||
impl<Descriptor: SocketDescriptor, TransportImpl: ITransport> PeerHolder<Descriptor, TransportImpl> { | |||
fn initialized_peer_by_node_id(&mut self, node_id: &PublicKey) -> Option<(Descriptor, &mut Peer<TransportImpl>)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's not exposed, you can add a comment. Generally I'm leaning towards commenting semantic of even small helpers as it's proven that confused naming bring bugs (cf. #690 recently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This seems obvious to me since I've been in the code, but may not be to a new reader.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
} | ||
self.do_attempt_write_data(&mut descriptor, &mut peer.sync_status, &mut peer.transport, &mut peer.pending_outbound_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this was called even if is_connected()==false
so it may have try for nothing to flush state and the socket descriptor send_data() would have return 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_peer_for_forwarding()
macro actually calls continue
if the node isn't connected so the line is never run. One of the code smells I cleaned up with refactor: Clean up event handling path w.r.t connected state
Definitely not a fan of macro usages like this
lightning/src/ln/peers/handler.rs
Outdated
node_id_to_descriptor.remove(&their_node_id); | ||
self.message_handler.chan_handler.peer_disconnected(&their_node_id, false); | ||
|
||
return false; // retain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this actually drop the peer so should not retain
.
Maybe instead of retain
/needs_to_write_data
you can write inline comment about the high-level action which is taken, like conserving this peer as timely fulfilling ping exchange
/ dropping this peer as failing ping exchange
/ need to send a ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, just a case where it had to look a bit worse before it looks better. Cleaned up in refactor: Clean up the PeerHolder access/iteration/remove
which has some comments. Let me know if that looks OK to you.
lightning/src/ln/peers/transport.rs
Outdated
@@ -82,14 +82,16 @@ impl<PeerHandshakeImpl: IPeerHandshake> ITransport for Transport<PeerHandshakeIm | |||
if let Some((conduit, remote_pubkey)) = conduit_and_remote_pubkey_option { | |||
self.conduit = Some(conduit); | |||
self.their_node_id = Some(remote_pubkey); | |||
Ok(true) // newly connected | |||
} else { | |||
Ok(false) // newly connected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these inline comments are misleading, it's actually not connected ? Maybe add a method comment to help understand semantic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe newly
is the wrong word? This function returns true
on the first call that ends with a connected state. I've updated the trait API and put an inline comment in review: Small nits, documentation, and fixups
to help future readers.
lightning/src/ln/peers/handler.rs
Outdated
} | ||
|
||
// Process an incoming Init message and set Peer and PeerManager state accordingly | ||
fn process_init_message(&self, message: Message, descriptor: &Descriptor, peers_needing_send: &mut HashSet<Descriptor>, node_id_to_descriptor: &mut HashMap<PublicKey, Descriptor>, pending_outbound_buffer: &mut impl PayloadQueuer, outbound: bool, transport: &mut impl ITransport, post_init_state: &mut Option<PostInitState>) -> Result<(), PeerHandleError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you precise other decisions taken ? Sending init
if inbound connection, starting sync if signaled, disconnect peer if non-Init first message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Small nits, documentation, and fixups
lightning/src/ln/peers/handler.rs
Outdated
Message::Init(ref init_message) => { | ||
log_trace!(self.logger, "Received Init message from {}", log_pubkey!(&their_node_id)); | ||
if node_id_to_descriptor.contains_key(&their_node_id) { | ||
log_trace!(self.logger, "Got second connection with {}, closing", log_pubkey!(&their_node_id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should catch duplicate connection earlier that init message processing, as soon we can do it in read_event()
? We have both node_id_to_descriptor
and peer_descriptor
there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, that seems reasonable. I added a review patch review: Catch duplicate connection earlier
to catch it at the completion of the NOISE handshake instead of the INIT message processing.
lightning/src/ln/peers/handler.rs
Outdated
log_info!(self.logger, "Peer global features required unknown version bits"); | ||
return Err(PeerHandleError { no_connection_possible: true }.into()); | ||
} | ||
if init_message.features.requires_unknown_bits() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is duplicated, global/local are legacy notions, see BOLT 9 and #428. Not yours, but you can a commit to drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Remove legacy (local/global) feature flag notation
lightning/src/ln/peers/handler.rs
Outdated
} | ||
|
||
log_info!( | ||
self.logger, "Received peer Init message: data_loss_protect: {}, initial_routing_sync: {}, upfront_shutdown_script: {}, static_remote_key: {}, unknown flags (local and global): {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, replace local/global
per flat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Remove legacy (local/global) feature flag notation
72a5400
to
f02b6a8
Compare
Thanks for the full first pass @ariard. I've addressed all comments and pushed another version with your requests. I don't think the encryption thread pool will be hard to integrate, but I'm interested to hear your thoughts. The Encryptor/Decryptor will just need to be updated be passed the pool and use it for the operations. Should make it simple to test as well since those are now separate objects. Definitely something to do once it is a problem and not overengineer too much upfront. |
The module had more bugs than I expected, but that validated all the work up front to make it testable. I think it is a great case study for how tests like this can be leveraged to make the code a lot more stable. You should see quick feature iteration on this layer too now that the tests exist. |
20befe2
to
5c346e0
Compare
Rebased against upstream and added licenses to new files. |
lightning/src/ln/peers/handler.rs
Outdated
use bitcoin::hashes::{HashEngine, Hash}; | ||
use ln::peers::outbound_queue::OutboundQueue; | ||
use ln::peers::transport::{PayloadQueuer, Transport}; | ||
use bitcoin::hashes::core::iter::Filter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be std::iter::Filter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: squashed into refactor: Clean up the PeerHolder access/iteration/remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it pretty hard to track what's going on with all the new traits, can we try to squash them a bit, and/or make the naming more consistant across traits vs structs, and/or add more documentation describing what they do?
CM::Target: ChannelMessageHandler, | ||
RM::Target: RoutingMessageHandler, | ||
L::Target: Logger { | ||
inner: PeerManagerImpl<Descriptor, CM, RM, L>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need an inner
vs just putting the contents in PeerManager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default type parameters in the public interface requires that the Transport
type has public visibility as well.
This would mean leaking private types as well as making it unclear from a user perspective on whether or not they needed to implement a Transport object (it would show up in the rustdoc as well). Since this is just a pattern for testability, I chose to hide it as an inner module which allowed the use of dependency injection without an interface change to the user.
lightning/src/ln/peers/handler.rs
Outdated
|
||
/// Trait representing a container that can try to flush data through a SocketDescriptor. Used by the | ||
/// PeerManager to handle flushing the outbound queue and flow control. | ||
pub(super) trait SocketDescriptorFlusher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this separate from MessageQueuer, it looks like everything implements both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating what an object can do and what another object needs it to provide allows for a more modular design and easier testing. The fact that OutboundQueue
implements both interfaces is more of a convenience than a reason to flatten the interfaces.
By splitting the entire API into the interfaces that each object needs, you can easily write tests by just implementing test doubles for the subset of the API and not have to mock/spy a bunch of unused features/functions. It also allows the function signatures to document the operations that the function will use which can come in handy when trying to restrict operations in a different phase (such as no flush in read_event()
)
Testing
The transport tests for process_input()
only need access to the subset of the API that enqueues messages for sending. This allows the tests to use a simple Vec spy instead of a full OutboundQueue and get the same test coverage.
Interface vs Object Development
Using interfaces instead of objects also focuses the attention on which features a consumer object needs to get their work done. In my experience, this leads to smaller interfaces between objects that don't overengineer anything.
Another approach I experimented with was to have a separate SocketDescriptorFlusherImpl
object that uses composition w/ a PayloadQueuer
and just handles the logic to add Sync messages and flow control. This separate object would implement the SocketDescriptorFlusher
trait and fits more in line with the single responsibility principle.
Given the longer feedback loops and my experience with OO principles being less ingrained in open source projects, I chose to limit the number of new objects and first see if this was an approach that was interesting to the project.
@@ -60,6 +60,9 @@ pub(super) trait ITransport { | |||
/// Returns true if the connection is established and encrypted messages can be sent. | |||
fn is_connected(&self) -> bool; | |||
|
|||
/// Returns all Messages that have been received and can be parsed by the Transport | |||
fn drain_messages<L: Deref>(&mut self, logger: L) -> Result<Vec<Message>, PeerHandleError> where L::Target: Logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this return a Vec
and not an Iterator
?
lightning/src/ln/peers/handler.rs
Outdated
const OUTBOUND_QUEUE_SIZE: usize = 10; | ||
|
||
/// Interface PeerManager uses to interact with the Transport object | ||
pub(super) trait ITransport: MessageQueuer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it very hard with this design to see trivially where the queues are hiding, and how big they can grow, and what each trait means. Can we get some documentation on what the difference between a Message and PayloadQueuer is, how that fits into the OutboundQueue, why process_input
takes a trait which OutboundQueue
implements, etc. I also think we can squash a bunch of stuff to make this much more readable - can we avoid having both a PayloadQueuer and a MessageQueuer and only have one queue and one thing that encrypts/decrypts against the queue, or is that hiding here somewhere and I dont see it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec vs. Iterator
This was the design chosen in #494 and I chose to limit the number of behavior changes in the initial refactoring to ease the review (although I'm not sure how much that really saved in the end). I've done my best to separate out behavior changes to a separate patch so they can be easily dropped if necessary.
This could just as easily be a drain iterator or an iterator followed by clear()
or drain()
. I don't really like the pattern of iterators that also remove, so this was the cleanest analog. I don't think there is any additional memory pressure here since the bytes exist in the Decryptor anyway and they are effectively transferred to this layer.
But, it seems reasonable to change this up in the future if there is a valid reason. An iterator approach would mesh well with my thoughts in #698.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cleaning up the naming might make this a bit easier. I'm not attached to any of the trait names but tried to follow Rust trait idioms, but I can see how they can all blend if you haven't been staring at the code for 6 weeks straight. I've answered your questions inline. If you have better names that make the traits a lot more obvious I am happy to clean them up.
I find it very hard with this design to see trivially where the queues are hiding,
The only queue is the OutboundQueue
and is owned by the PeerManager
at the top level. Same as the LL before.
and how big they can grow,
The PeerManager
shouldn't care how big they can grow and this was one of the benefits of splitting out the OutboundQueue
object. I've kept the const at the top level, but that was only to keep a single constructor for tests and the actual implementation.
If there are performance issues in the future, the features can just be added to the OutboundQueue
and tested in isolation.
As far as max size, the read_event()
path pushes unconditionally. There was no behavior change for this in my patch stack and the API and docs now reflect the behavior. Is this the best long-term, probably not. That feature is captured in #703.
and what each trait means. Can we get some documentation on what the difference between a Message and PayloadQueuer is,
A naming fixup might help the confusion. I will do my best to describe the relationships here and your feedback can help me clean up the trait documentation to help the next reader.
The PeerHandler
owns the OutboundQueue
which is a simple wrapper object for raw bytes that need to be sent out of the socket descriptor. The OutboundQueue
implements two traits:
PayloadQueuer
handles the raw byte enqueue into the queueSocketDescriptorFlusher
handles the flushing and flow control
The PayloadQueuer
is the interface consumed by the Transport
layer. It needs a way to push unencrypted bytes pre-NOISE and encrypted bytes post-NOISE.
how that fits into the OutboundQueue, why
process_input
takes a trait whichOutboundQueue
implements, etc.
process_input()
takes a PayloadQueuer
because it is just responsible for writing raw bytes into the OutboundQueue
. By taking the trait, instead of the OutboundQueue
, it is more clear (and guaranteed by the type system) that the Transport
layer is only responsible for pushing items and not flushing them.
I also think we can squash a bunch of stuff to make this much more readable
Sort of opposite feedback from @ariard who asked me to split it out more. Is this something that can just be fixed with git diff
? I did my best to not only break up the commits into smaller consumable pieces but also walk the reviewers through my train of thought as I did the refactor. Big refactors like this don't always have an obvious endpoint when you start. You have to untangle it piece by piece and see what it looks like. When I'm reviewing large changes like this I find it adds a lot more confidence if I can see how someone approached a redesign vs. just the final product.
In the end, if you can't review it this will just sit here forever, so please let me know how I can help make it easier on you.
- can we avoid having both a PayloadQueuer and a MessageQueuer and only have one queue and one thing that encrypts/decrypts against the queue, or is that hiding here somewhere and I don't see it?
There is one queue. The raw byte (Payload) queue OutboundQueue
is owned by the PeerManager
.
The PeerManager
only interacts with MessageQueuer
since it only needs to send post-NOISE Messages
The Transport
only interacts with PayloadQueuer
since it just deals with raw bytes. The unencrypted ACT bytes and encrypted Messages.
During pre-NOISE, the Transport
layer takes in a PayloadQueuer
during process_input()
where it will enqueue the unencrypted ACT bytes.
During post-NOISE, the PeerManager
asks the Transport
layer to encrypt bytes via enqueue_message()
which is essentially: take this Message, encrypt it, and add it to the queue.
This separation allows the PeerManager
to own the queue and give it to the Transport
functions that need to enqueue onto it.
So, in the end, it matches with what you want... I think. There is one queue OutboundQueue
and Transport
encrypts/decrypts against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think its possible with more documentation and trait renaming to make it clearer what's going on, maybe without ripping out all the test traits, though the ratio of traits-to-code seems difficult to swallow. eg it may make sense to reduce the trait count and take the complexity hit by adding more shims in the test side of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: review: Collapse OutboundQueue/Transport traits
5c346e0
to
b3c5dc2
Compare
@TheBlueMatt Thanks for taking a quick look at this. Some of the points you bring up are a little easier to talk about with concrete code. So, I am going to write a patch that uses less traits so that we can talk about the tradeoffs and see which is the right way to go moving forward. At the time of development, I was hopeful that I would be able to continue working on LDK and really show how this type of development style can be leveraged for better testing and new features. Now that that isn't the case, I think it is more important to get it into a state that is easier for the core team to maintain while still getting a lot of the benefits from test doubles. I'll push a new version in the next day or so when I find some time and we can go from there. |
b3c5dc2
to
019ea6e
Compare
This will make future changes that stabilize this work easier to review and merge. Co-authored-by: Arik Sosman <[email protected]> Co-authored-by: Julian Knutsen <[email protected]>
Lots of build warnings still, but it builds and tests pass.
This patch ontinues to separate state the exists before NOISE is complete and after it is complete to unlock future refactoring. Most callers immediately unwrapped the value from Peer and can just call Transport::get_their_node_id(). The duplicate connection disconnect path has been rewritten to determine whether or not to remove & send a disconnect event without needing to use a None value for Option<PublicKey> All other users are in contexts where they either exit early or continue if !transport.is_connected() so it is also safe to call Transport::get_their_node_id()
Use the newly abstracted Transport layer to write unit tests for most of the PeerManager functionality. This uses dependency inversion with the PeerManager to pass in a Transport test double that is configured to exercise the interesting code paths. To enable this, a new struct PeerManagerImpl has been created that has all of the original functionality of PeerManager, but includes a type parameter for the Transport implementation that should be used. To keep this test-feature hidden from public docs, the PeerManager struct is now just a shim that delegates all the calls to PeerManagerImpl. This provides a nice balance of a clean public interface with the features needed to create isolated tests. The tests make use of the Spy and Stub test patterns to control input state and validate the output state.
Pass in all the in/out parameters individually instead of taking everything out of peer. The important thing to notice here is that the pending_outbound_buffer parameter MUST be an OutboundQueue and CANNOT be an 'impl PayloadQueuer'. This demonstrates that the read_event() function has the ability to flush() since it can use the SocketDescriptorFlusher interface. This should be illegal due to the invariant that read_event() never calls back into the SocketDescriptor. This is one of the motivations of using trait bounds in argument position as a defensive programing technique to prohibit functions from taking actions that shouldn't be allowed even if they have access to the object. Future patches will fix this bug and ensure the type system enforces this bug from happening again. This is a short-lived destructure that will be recombined once the dependencies are cleaned up.
Per the external documentation, there is an invariant that read_event() will not call back into the SocketDescriptor, but the actual code flushed the outbound queue. Fix the bug and use the type system to guarantee this behavior in the future. By changing the function parameters to use `impl PayloadQueuer` the function will only have access to functions on that interface throughout the execution, but still be able to pass the OutboundQueue object since it satisfies the trait bounds. Functions such as enqueue_message() also take a `impl PayloadQueuer` so they can be called from that context, but functions that need the SocketDescriptorFlusher interface will fail at compile-time and point to the issue before any tests need to be run. This also fixes up tests and the tokio implementation that did not implement the proper behavior and relied on read_event() calling send_data().
Introduce the MessageQueuer interface used by Transport to queue messages for send and use it in all the existing places where messages are queued.
Improves readability during review.
Many of the is_connected() checks were duplicative due to the macro continuing out of the match statement if the peer referenced by the node_id had not seen an Init message yet. Remove the macro in favor of a function on PeerHolder that will return an Option<> and use it instead. Split out a new helper function Peer::is_initialized() that will return true if the peer has seen an Init message.
Encapsulate the Peer information that is only valid after an Init message has been seen. This makes the accesses to sync_status, ping information, and their_features cleaner w.r.t. the Peer initialization state.
The previous implementation would send a peer_disconnect callback for any error after the NOISE handshake completed, but only send a peer_connected callback after the Init message was received. This could lead to a dangling peer_disconnected callback that didn't have a paired peer_connected. This patch cleans up the state differentiation to match the following cases: unconnected (!transport.is_connected() && peer.post_init_info.is_none()) * Know about the peer, but are still in the process of the NOISE handshake connected (transport.is_connected() && peer.post_init_info.is_none()) * NOISE handshake completed, but haven't received an Init message initialized (transport.is_connected() && peer.post_init_info.is_some()) * The NOISE handshake has completed and the Init message has been received and processed With the 3 conceptual states, the read_event() path now only inserts a Peer into the node_id_to_descriptor map after a Peer enters the initialized state. This fixes the asymmetry between peer_disconnected & peer_connected as well as simplifies the disconnect code.
Now that the Init handling has been moved and the dependencies are more clear, deduplicate the parameters of the read_event() and helper functions to make use of Peer. The overall pattern remains the same, read_event() does the locking and passes in the separate items (peer, peers_needing_send, node_id_to_descriptor) to do_read_event(). The handle_message() path is also cleaned up now that post_init_state is guaranteed to be valid at that point.
Now that the locking and destructuring is done in read_event(), the workarounds for the pre-NLL borrow checker can go away.
This patch expands the PeerHolder API allowing for encapsulation of the peer state from the code that needs to iterate over initialized peers. This cleans up peer iteration/removal allowing for a much simpler design. 1) Introduce new APIs for PeerHolder: * initialized_peers_mut() * initialized_peer_node_ids() * remove_peer_by_descriptor() 2) Clean up event handler iteration using new APIs 3) Unify the timer_tick and DisconnectPeer event disconnect path using new APIs 4) Convert get_peer_node_ids() to use new API
The broadcast event handling code did not catch Ok(false) returned from the route handler when deciding whether or not to broadcast messages. Instead, it only checked if the return value was an Error. Fix it up and enable the regression tests.
Use a separate lock to generate the SecretKey instead of overloading the PeerHolder mutex. Refactoring PeerManager removed this hidden constraint and this should make it more robust in the future.
s/fill_message_queue_with_sync/fill_outbound_queue_with_sync/ s/queue_init_message/enqueue_init_message/
Motivated by lightningdevkit#456, remove the peers_needing_send set in favor of just scanning the peers during process_events() and attempting to send data for those peers that have items in their outbound queue or pending sync items to be sent out.
Small review items that are noncontroversial and can be rolled into a single commit.
Catch a duplication connection from a peer with the same node_id at the completion of the NOISE handshake instead of the processing of the INIT message.
* Remove local/global naming when processing INIT message contents * Remove duplicate requires_unknown_bits() check
Reduce trait complexity in the OutboundQueue and Transport objects. Splitting the objects into separate traits allowed for cleaner separation of responsibilities. But in practice, this led to complexity in the PeerManager function signatures and confusion in why they needed to be separated. To move this module to a more maintainable state for the core development team, collapse PayloadQueuer/SocketDescriptorFlusher into a single IOutboundQueue trait. Also, remove MessageQueuer in favor of just passing in the entire Transport object. This makes the code look more similar to the rest of the codebase while still leveraging traits in the OutboundQueue and Transport layer that allow for test doubles and real unit tests.
019ea6e
to
bf76e7c
Compare
Rebased against upstream and addressed @TheBlueMatt's feedback in I think that review patch is a good compromise between using traits for test doubles in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a bit clearer than it was, though I think there's still a lot of opportunity to reduce scoping of various structs and decouple objects/traits which are very tighly coupled right now.
@@ -60,6 +60,9 @@ pub(super) trait ITransport { | |||
/// Returns true if the connection is established and encrypted messages can be sent. | |||
fn is_connected(&self) -> bool; | |||
|
|||
/// Returns all Messages that have been received and can be parsed by the Transport | |||
fn drain_messages<L: Deref>(&mut self, logger: L) -> Result<Vec<Message>, PeerHandleError> where L::Target: Logger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Decryptor
already implements Iterator, so I presume this would be doable?
use std::ops::Deref; | ||
|
||
/// Interface used by Transport to interact with a handshake object | ||
pub trait IPeerHandshake { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs a rename ("I"-prefix seems to me to imply internal?) if its going to be public, as well as likely a much more thorough set of docs to explain what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I didn't find an existing pattern for interface traits in the codebase. In previous jobs, interfaces had "I" prefixes/suffixes so the types were identifiable. Let me know what you would prefer and I can do the rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, right, I think there's less chance of confusion in Rust since things either need to be explicitly generic across a trait or (at least once we upgrade to newer rust) carry the dyn
identifier. I'm not sure a prefix is required, I obviously misread the intent. In any case, does this need to be pub, or should it be pub only in fuzztarget mode/for testing? I'm not sure what the specific use-case is for the API, maybe @arik-so had one?
This feedback isn't very actionable. There is always more cleanup that can get done. The question is whether or not this is a large enough improvement from the current main or if additional PRs need to be added to get it into a more final state that you are happy merging. Personally, I think this is a huge improvement over the preexisting code and the additional tests and bugfixes are worth merging. But, I will leave that up to the Square team to decide. |
Right, I generally use the top-level comment to summarize my thinking (in this case across several PRs, not just this one), and inline comments to note specific action-items. |
The author decided to move on some time ago, and redoing the great work here would likely require a substantial effort to rebase and address the regressions that were being worked on in the later PRs in this set when the author moved on. Going to close. |
PRs:
(1/4) Enum Dispatch NOISE State Machine & Unit Tests
(2/4) PeerManager Unit Tests and Refactoring
(3/4) Conduit fixups from 494
(4/4) Update OutboundQueue flush behavior and fix existing bugs that it found
Motivation
Bug Fixes
Interface Usage
Refactor & Unit Tests
Next Action: Reviewers
History
TL;DR
Motivation
The original #494 highlighted some design issues in the way the PeerManager interacted with the Conduit and PeerState objects. In order to address the design issues and perform refactoring, the module needed to have a more complete set of tests. This PR focuses on the initial work needed to unit test PeerManager (creating the Transport abstraction), the unit tests themselves, and the resulting bugfixes and refactoring that could be performed safely with the security of the tests.
Bug Fixes
The testing and code audit uncovered a few bugs that were fixed in the following commits:
fix: Fix reentrancy bug in read_event()
fix: Don't allow peer_disconnect before peer_connected callbacks
fix: Handle Ok(false) from route_handler callbacks in event path
fix: Use a separate lock for the ephemeral_key_midstate
Interface Usage
Interfaces are used in this PR in a few different ways.
Using trait-based test doubles can make tests much easier to write and read by abstracting complex objects such as the
OutboundQueue
and replacing it with a simpleVec<Vec<u8>>
in the tests that only care about pushing data and validating the contents. Here is an example fromtransport.rs
that would otherwise have to instantiate anOutboundQueue
for all the tests leading to more complex interactions and a larger system under test than necessary.Testing and Refactoring
In order to test the PeerManager in an isolated way, the major dependency of the Transport layer needed to be abstracted so it could be mocked. The first set of patches deal with moving the relevant code to a new testable module that can be mocked out for the PeerManager unit tests.
The sheer number of tests required to test the PeerManager is a strong indication it should be broken down into even smaller testable objects to reduce the scope of each test. This PR is a huge step forward, but there is still plenty of work to make it more maintainable.
Refactorings were done alongside the bug fixes to make the code more maintainable. Overall, the design is much cleaner, and future features can be implemented with the safety of regression tests.
Testing Style Overview
Rust doesn't have a great story for dependency injection or mocking like more mature languages, so the patterns are a bit heavy-handed. The general idea is that you define a trait for each interface (dependency) the object under test needs to do its job. In the production code, it will use the full-featured "real" code, but in tests, these can be swapped out for test structures that enable more granular testing. There are generally two ways to do this that are both used in this PR.
Static Injection
This is the mocking used in the Transport tests. It is good for small interfaces where the number of cases is very small and requires little configuration. The entire type is swapped out in tests for a test double that has pre-defined behavior.
Dependency Inversion w/ Generics
This is the variant used in the PeerManager tests. It is better for cases where you may need to alter the test double during the test or there are so many initial configuration cases that creating static structs and swapping them in completely doesn't make sense. In the PeerManager case, this is done through dependency inversion. The PeerManager takes in a Transport object during the initialization calls (new_inbound/new_outbound) and uses it. Generics are leveraged to avoid trait objects and default type parameters are used to clean up non-test callers.
The tests configure a Transport test double in a specific state and pass it in to test the intended behavior.